Public benchmark
PDF reference extraction should be judged by complete, usable exports.
The benchmark shows how LumaCite performs on real academic PDFs: how many references were selected, whether suspicious rows appeared, whether DOI values looked usable, and whether the output was safe to export into reference managers and writing workflows.
Measured examples
Benchmark cases focus on what users can verify.
Public benchmark language should prove value without revealing the internal recipe. These cases describe the visible quality signals a user cares about.
| Benchmark case | Expected references | LumaCite result | Public quality signal |
|---|---|---|---|
| Genes & Development 2023, Mendez-Dorantes et al. | 195 | 195 references selected; clean export allowed after review checks passed. | Previously difficult because one reference tail could appear as an extra row. Current result avoids that public failure mode. |
| Large academic PDF with 300+ references | 311 | 311 references selected; fast first result with export-ready review details. | Shows the value of returning a usable list without waiting for every optional metadata field to finish. |
| Two-column or footer-heavy bibliography | Varies | Rows are reviewed for suspicious fragments, missing fields, and export risk. | Useful because two-column text can look complete while still hiding split or merged references. |
| Scanned or image-heavy PDFs | Varies | Usually treated as higher risk; better scanned-PDF support is roadmap. | Prevents overpromising. Image-only PDFs need different handling than text-based articles. |
Benchmark method
We measure the workflow, not just the extracted text.
Raw extracted text is not enough. A useful benchmark asks whether the resulting references are complete, reviewable, exportable, and safe enough for a real research workflow.
| Accuracy signal | What users see | Why it matters |
|---|---|---|
| Expected count match | Selected reference count and count warnings. | Catches missing, extra, merged, or fragmented rows before import. |
| Suspicious fragment detection | Rows that look incomplete are flagged for review. | A journal/page/DOI tail should not become its own citation. |
| Identifier quality | Malformed DOI or missing identifier warnings. | Broken identifiers create bad metadata and painful cleanup later. |
| Export decision | Clean export, high-risk export, or review-needed messaging. | Users know whether to download immediately or inspect the list first. |
| Audit report | Downloadable record of counts, warnings, and extraction review signals. | Useful for librarians, research assistants, editors, and systematic review teams. |
Benchmark roadmap
More public cases will make the accuracy story stronger.
The next benchmark expansion should include more publishers, more citation styles, scanned PDFs, long bibliographies, and intentionally difficult layout cases.
Add representative papers from major publisher styles.
Show how review warnings catch rows that would have exported incorrectly.
Separate scanned PDFs from normal text-based PDF benchmarks.
Publish sample audit reports users can inspect before trying the tool.
Benchmark questions
Why benchmark PDF reference extraction at all?
Because a bibliography can look clean while still missing entries, adding fragments, or carrying broken DOI values. Benchmarks make those risks visible.
Does one accurate PDF prove the extractor is perfect?
No. One case proves one failure mode was measured. A good benchmark grows over time and keeps adding hard real-world PDFs.
What is the most important benchmark number?
Reference count is the first checkpoint, but not the only one. A count can match while some fields still need review.
Why include export safety?
The user goal is not merely extraction. The goal is safe import into a writing or reference-management workflow.
Will the benchmark include competitor results?
It can later, but the first priority is a transparent LumaCite benchmark that measures its own output honestly.