OCR-Assisted Document Evidence Triage: A Forensic Examiner Workflow for Scanned PDFs and Image Files

Scanned PDFs plus image-based document evidence routinely arrive in forensic engagements: court records from older cases, regulator submissions filed before digital intake was standard, historical paper files preserved through scanning. The documents may carry decisive information but the lack of a searchable text layer makes review impractical without OCR. Sherlock Forensics walks through the end-to-end OCR triage workflow using Sherlock OCR Reader plus the chain of custody discipline that supports court submission.

The short answer: Five-step workflow. Acquire the scanned document file from the source. Hash plus document chain of custody. OCR with Sherlock OCR Reader (no external OCR engine required). Apply keyword filters to identify in-scope pages. Export to searchable PDF plus CSV summary for downstream review. Total workflow time approximately 35 minutes for a typical 200-page document.

Why OCR Triage Still Matters in 2026 Forensic Practice

Born-digital documents dominate modern forensic engagements but scanned evidence remains common in three recurring contexts. First, historical case files. Civil litigation involving events from before 2015 routinely surfaces documents that were originally paper plus subsequently scanned for storage. The scanned copies may carry decisive evidence but without searchable text they are impractical to review at scale. Second, regulator submissions. Many Canadian regulator filing systems accept paper plus scanned submissions in parallel with digital filings. Forensic investigation into regulator interactions often requires reviewing the scanned submissions to confirm what was actually filed plus when. Third, court records. Provincial court record systems vary widely in digital availability. Older court records, particularly from criminal proceedings plus from rural court registries, are routinely paper-only plus require scanning before forensic analysis.

For investigators handling any of these contexts the OCR triage workflow is the load-bearing step that transforms image-based documents into searchable forensic evidence. The Sherlock OCR Reader handles the workflow in a single tool with the chain of custody discipline appropriate for litigation contexts.

Step 1: Acquire the Scanned Document File

Forensic acquisition of scanned document evidence follows the standard rules: image the source device when possible, extract the file when full device imaging is not justified, document the acquisition step in the case log. Scanned documents typically arrive as PDF (image-only PDF or mixed image plus text PDF), TIFF (multi-page TIFF is common for court records), JPEG (single-page or as part of a folder) or PNG (less common but seen). The first triage step is confirming whether the document is image-only by attempting text selection on a sample page. If no text is selectable the document is scanned imagery without an embedded text layer plus needs OCR.

For documents arriving on physical storage media (CD-ROM, DVD, external drive) the acquisition includes a forensic image of the source media plus per-file hash records. For documents arriving via secure file transfer (regulator portal download, encrypted email attachment) the acquisition includes the transfer record plus the per-file hash at the moment of receipt. The chain of custody discipline starts at the moment of acquisition regardless of the source channel.

Step 2: Hash plus Chain of Custody

Compute SHA256 of the source document file before any analysis. Record the hash, the acquisition timestamp (UTC plus local), the source channel, the examiner identity plus the case number into the chain of custody log. Document the working-copy hash separately if you are operating on a copy. The Sherlock hash verifier produces a signed acquisition record suitable for litigation submission.

For investigations heading toward Canadian civil or criminal court the chain of custody documentation needs to satisfy the evidentiary standards in the relevant jurisdiction. The Canada Evidence Act plus the provincial evidence acts have specific requirements for digital evidence authentication. The chain of custody for scanned documents is identical in principle to other digital evidence but the workflow steps differ because the OCR transformation produces a derivative artifact (the searchable text layer) that itself needs hash documentation.

Step 3: OCR With Sherlock OCR Reader

Open the working copy of the document in Sherlock OCR Reader. The tool launches without requiring an external OCR engine installation (no Tesseract install, no ABBYY install, no Adobe Acrobat Pro license). The first-load operation extracts the OCR text layer using the embedded OCR engine plus produces the page-by-page rendered preview alongside the OCR confidence indicators. For multi-page documents (court records typically 50 to 500 pages, regulator filings sometimes 1000+ pages) the first-pass OCR completes in approximately 2 to 4 minutes per 100 pages on modern forensic hardware.

The OCR engine handles typed text reliably plus handwritten content with degraded confidence. The page-by-page confidence indicator surfaces pages where the OCR result is suspect plus warrants examiner review. For court records with original handwriting plus typewritten content mixed on the same page the examiner workflow reviews the low-confidence pages manually plus accepts the high-confidence OCR for the typewritten portions.

Step 4: Extract Searchable Text

Apply keyword search across the OCR text layer to identify the in-scope pages plus passages for the investigation. Sherlock OCR Reader supports compound keyword filters across multiple terms with proximity operators. For example: keyword payment WITHIN 5 words of vendor reveals payment-vendor proximity matches; keyword acquisition NOT WITHIN same page as approved reveals acquisition references without nearby approval language. The filter pane shows matching pages with surrounding context plus the per-page hit count.

For large documents the filter step is the workflow component that delivers most of the value. A 500-page scanned regulator filing may contain 1000 to 5000 page-level matches across the keyword set; the in-scope subset for a specific investigation might be 20 to 50 pages. The filter reduces the manual review burden by two orders of magnitude. The exported filter set carries the original page imagery plus the OCR text plus the case log reference.

Step 5: Export to PDF plus CSV

Export the in-scope pages to PDF plus the keyword hit summary to CSV. The PDF export preserves the original scanned image fidelity plus adds the OCR text layer underneath the image. This produces a searchable PDF that downstream review tooling (Relativity, Logikcull, Disco, Everlaw, Canadian-specialized review platforms) can ingest plus search natively. The CSV summary lists the per-keyword hit count, the matched page numbers plus the page-level confidence score for each match.

For Canadian civil litigation the most common downstream format is searchable PDF for review platform ingest plus CSV for evidence inventory. For criminal litigation the same outputs apply plus the chain of custody documentation needs to cover the OCR transformation step explicitly. The Sherlock OCR Reader output is structured to support both downstream contexts without additional examiner work.

The Common Failure Mode This Workflow Prevents

The failure mode the workflow prevents is the manual review of image-based documents page by page. Investigators who attempt manual review on a 500-page scanned regulator filing typically take 8 to 12 hours of examiner time plus produce inconsistent findings because the human reviewer cannot reliably keyword-scan across that volume. The OCR triage workflow completes the same effective review in approximately 35 minutes plus produces a documented chain of custody.

For organizations handling document evidence at scale the operational discipline is to standardize the OCR triage workflow across the team plus apply it as a routine intake step for any scanned document. The Sherlock OCR Reader storefront documents pricing plus licensing for organizations that want to build internal capacity. For organizations needing engagement work Sherlock Forensics applies the same methodology as part of the standard forensic engagement scope.