Modern litigation productions are dominated by born-digital content including email, Office documents, chat exports and database extracts. The exception is the scanned-document subset: physical paperwork that someone scanned, faxed records, photographs of documents, archived correspondence from before the digital-native era and contracts produced as image-only PDFs.
Scanned documents require OCR to become searchable, reviewable and producible at scale. The OCR step is also where chain of custody is either preserved or destroyed. This guide is for the litigation support manager or paralegal producing scanned document evidence for litigation.
The Production Lifecycle for Scanned Documents
A typical scanned document production runs through six stages:
- Acquisition. Scanned documents arrive from the custodian. Could be physical scans from a copier, photos taken on a phone, faxed records or pre-existing image-only PDFs in archive directories.
- Forensic intake. Each source file is hashed (SHA-256) at receipt with chain of possession documented. The custodian, the date, the source media and the examiner identity all recorded.
- OCR extraction. Text content is extracted from each source image. Per-page hashes are computed and tied back to the source file hash.
- Review. Paralegals or contract reviewers read the OCR-extracted content for relevance, responsiveness and privilege. Tags applied per the review protocol.
- Redaction and Bates stamping. Privileged content redacted. Bates numbers applied per the production protocol.
- Production. Bates-stamped redacted PDF set delivered to opposing counsel with a load file mapping each PDF to its metadata fields.
Each stage has discrete chain-of-custody obligations. The OCR stage is where the documentation diverges between non-evidentiary and evidentiary workflows.
The Chain of Custody Gap at the OCR Stage
In a non-evidentiary workflow, OCR is treated as a format conversion where the source image becomes searchable text without further documentation. For internal information retrieval that is fine. For litigation production it creates an evidentiary gap.
The gap: a reviewer reading OCR-extracted text in the review platform cannot verify the text accurately represents the source image without re-running OCR. The text might contain misreadings, the page might be partial or the formatting might have been lost. If the production is later relied upon in deposition or motion practice, opposing counsel can challenge the authenticity of the extracted text and the producing party has to defend the OCR methodology after the fact.
A defensible OCR workflow closes the gap by hashing the source-and-output at extraction time. Each extracted page carries a SHA-256 of the source image and a SHA-256 of the OCR output. A reviewer or opposing expert can verify the chain at any later point by re-hashing the source image and confirming the hash matches the chain-of-custody documentation.
The Sherlock OCR Reader Workflow for Litigation Production
The workflow that pairs cleanly with a review-platform production pipeline:
- Source intake. Hash each source file at receipt. Document chain of possession.
- Open in Sherlock OCR Reader Forensic Edition. The tool identifies pages requiring OCR vs pages with embedded text.
- Run the OCR pass. Per-character confidence scoring computed at extraction. Per-page SHA-256 computed and stored.
- Review the extraction. Pages with low overall confidence flagged for human verification. Examiner corrections logged.
- Generate the forensic PDF report. Branded report with source file metadata, per-page extraction summary, confidence statistics, examiner attestation and chain-of-custody footer.
- Export the searchable PDF. Output PDF with embedded text matching source image positions. Bates-stampable in the downstream review platform.
- Production set assembly. Source image plus searchable PDF plus forensic PDF report plus signed JSON sidecar with per-page hashes.
- Review platform ingestion. The searchable PDF imports into Relativity, Logikcull, Concordance, Reveal or Everlaw with the OCR text mapped to extracted-text fields. The forensic PDF and JSON sidecar archive in the production-documentation directory.
- Bates stamping in the review platform. Standard Bates numbering applied to the searchable PDF set during the review process. The Bates stamp does not modify the source image or the chain-of-custody documentation. The Bates stamp modifies the production-set PDF only.
- Production delivery. Bates-stamped PDF set plus load file (mapping each PDF to its metadata) plus production cover letter delivered to opposing counsel. The forensic PDF report and signed JSON sidecar archive on the producing party's side for any future authentication challenge.
What Goes in the Load File
A production load file typically maps each PDF to:
- Bates begin number
- Bates end number
- Custodian
- Document date
- Source file path (within the production)
- Extracted text path (the .txt file with the OCR output, if exported separately)
- Per-document hash (typically SHA-1 historically or SHA-256 in modern productions)
Sherlock's per-page SHA-256 hashes can populate the load file's hash field directly. The signed JSON sidecar provides the authoritative reference for each document's hash with the audit trail showing when the hash was computed and by whom.
When the Production Is Challenged
If opposing counsel challenges the authenticity of the extracted text (common in matters where the producing party's interest is to suppress specific text content) the defending party needs:
- The original source image, unmodified
- The hash of the source image computed at extraction time
- The OCR-extracted text, with per-character confidence scoring available for review
- The examiner's attestation that the extraction was performed without modification to the source
- The chain-of-custody documentation linking the source image to the extracted text through the cryptographic hash chain
Sherlock OCR Reader Forensic Edition produces items 2 through 5 in a single workflow. Item 1 is preserved from the original intake.
The defense to an authentication challenge becomes: "Here is the source image. Here is its hash. Here is the hash recorded at extraction time. Here is the examiner's attestation. The chain matches. The extraction is authentic."
A production without the chain documentation has to construct this defense reactively, often months after the original extraction, using whatever audit trail the producing party can reconstruct. That reconstruction often fails or is challenged successfully. The production is then excluded under FRE 901 or related authentication rules.
When Sherlock Is the Right Choice for This Workflow
- Litigation productions involving scanned document evidence
- E-discovery vendors producing for client matters where defensibility is required
- In-house legal departments handling internal document productions
- Compliance teams producing for regulatory inquiry
- Forensic consultants engaged by outside counsel for document examination
For these scenarios, the $67 lifetime cost is below the threshold of any procurement review and pays back the first time an authentication challenge is filed or anticipated.
When Adobe Acrobat or Generic OCR Is Sufficient
- Internal document review with no anticipated production scrutiny
- Document workflow that requires Acrobat's editing or signature capabilities alongside OCR
- One-off conversions for personal or internal use
- Productions where the parties have stipulated to the authenticity of the OCR output
For these scenarios, Acrobat or Tesseract handles the work and Sherlock is overspending. The full Acrobat vs Sherlock breakdown is covered in the vendor comparison page.
See Also
- Sherlock Forensics OCR Reader Forensic Edition, product page
- Forensic OCR for Document Evidence Extraction, OCR cluster entry with the full workflow and chain of custody breakdown
- Sherlock Forensics OCR Reader Forensic Edition vs Adobe Acrobat Pro OCR, vendor comparison for buyers evaluating their current OCR tool
- The Mid-Market Digital Forensics Toolkit, cross-cluster meta-hub