Public benchmark

PDF reference extraction should be judged by complete, usable exports.

The benchmark shows how LumaCite performs on real academic PDFs: how many references were selected, whether suspicious rows appeared, whether DOI values looked usable, and whether the output was safe to export into reference managers and writing workflows.

PDF reference extraction benchmark PDF bibliography accuracy PDF to BibTeX accuracy reference extraction quality citation extraction benchmark DOI extraction quality reference count check safe RIS export safe BibTeX export

Measured examples

Benchmark cases focus on what users can verify.

Public benchmark language should prove value without revealing the internal recipe. These cases describe the visible quality signals a user cares about.

Benchmark case Expected references LumaCite result Public quality signal
Genes & Development 2023, Mendez-Dorantes et al. 195 195 references selected; clean export allowed after review checks passed. Previously difficult because one reference tail could appear as an extra row. Current result avoids that public failure mode.
Large academic PDF with 300+ references 311 311 references selected; fast first result with export-ready review details. Shows the value of returning a usable list without waiting for every optional metadata field to finish.
Two-column or footer-heavy bibliography Varies Rows are reviewed for suspicious fragments, missing fields, and export risk. Useful because two-column text can look complete while still hiding split or merged references.
Scanned or image-heavy PDFs Varies Usually treated as higher risk; better scanned-PDF support is roadmap. Prevents overpromising. Image-only PDFs need different handling than text-based articles.
Reference countDoes the output include the full bibliography without extras?
Identifier qualityDo DOI, PMID, arXiv, ISBN, and URL values look usable?
Suspicious rowsAre fragments, merged references, or duplicate-looking rows flagged?
Export safetyShould a user import this file into Zotero, Mendeley, EndNote, or Paperpile?

Benchmark method

We measure the workflow, not just the extracted text.

Raw extracted text is not enough. A useful benchmark asks whether the resulting references are complete, reviewable, exportable, and safe enough for a real research workflow.

Count checkCompare selected rows against expected bibliography size when available.
Row-quality checkLook for fragments, repeated starts, broken endings, and suspicious short rows.
Identifier checkFlag malformed DOI-like strings and missing identifiers where review is useful.
Export checkConfirm the result is suitable for BibTeX, RIS, CSL-JSON, CSV, or audit export.
Accuracy signal What users see Why it matters
Expected count match Selected reference count and count warnings. Catches missing, extra, merged, or fragmented rows before import.
Suspicious fragment detection Rows that look incomplete are flagged for review. A journal/page/DOI tail should not become its own citation.
Identifier quality Malformed DOI or missing identifier warnings. Broken identifiers create bad metadata and painful cleanup later.
Export decision Clean export, high-risk export, or review-needed messaging. Users know whether to download immediately or inspect the list first.
Audit report Downloadable record of counts, warnings, and extraction review signals. Useful for librarians, research assistants, editors, and systematic review teams.

Benchmark roadmap

More public cases will make the accuracy story stronger.

The next benchmark expansion should include more publishers, more citation styles, scanned PDFs, long bibliographies, and intentionally difficult layout cases.

NextMore journal examples

Add representative papers from major publisher styles.

NextBefore-and-after cases

Show how review warnings catch rows that would have exported incorrectly.

PlannedScanned-PDF category

Separate scanned PDFs from normal text-based PDF benchmarks.

PlannedDownloadable audit samples

Publish sample audit reports users can inspect before trying the tool.

Benchmark questions

Why benchmark PDF reference extraction at all?

Because a bibliography can look clean while still missing entries, adding fragments, or carrying broken DOI values. Benchmarks make those risks visible.

Does one accurate PDF prove the extractor is perfect?

No. One case proves one failure mode was measured. A good benchmark grows over time and keeps adding hard real-world PDFs.

What is the most important benchmark number?

Reference count is the first checkpoint, but not the only one. A count can match while some fields still need review.

Why include export safety?

The user goal is not merely extraction. The goal is safe import into a writing or reference-management workflow.

Will the benchmark include competitor results?

It can later, but the first priority is a transparent LumaCite benchmark that measures its own output honestly.