How do I produce scanned documents for litigation?

A litigation production of scanned documents runs through six stages: acquisition from the custodian, forensic intake with SHA-256 hashing at receipt, OCR extraction with per-page hash linking source to extracted text, paralegal review for relevance and privilege, redaction and Bates stamping per the production protocol followed by delivery of the Bates-stamped redacted PDF set with a load file mapping each PDF to metadata fields. The OCR stage is where chain of custody is either preserved or destroyed. A defensible workflow uses an OCR tool that produces per-source SHA-256, per-page hashes, examiner attestation and a forensic PDF report as workflow outputs rather than constructing those artifacts separately after the fact.

Does Bates stamping affect chain of custody?

Bates stamping applied in the review platform during production assembly does not affect the chain of custody on the source image or the OCR extraction. The Bates stamp modifies the production-set PDF only. The source image hash and the OCR extraction hash were computed at intake before the Bates stamp was applied. The forensic chain anchors to the unmodified source artifact. The production-set PDF carries the Bates stamp as a downstream artifact whose provenance is the OCR output. If opposing counsel challenges the source-to-Bates-stamp chain, the producing party can demonstrate the hash chain from source image through OCR extraction through final Bates-stamped PDF using the signed JSON sidecar from the forensic OCR workflow.

What load file fields do I need for OCR-extracted documents?

A production load file for OCR-extracted scanned documents typically maps each PDF to Bates begin number, Bates end number, custodian, document date, source file path within the production, extracted text path (the .txt file with OCR output if exported separately) and per-document hash (typically SHA-1 historically or SHA-256 in modern productions). Sherlock Forensics OCR Reader Forensic Edition computes per-page SHA-256 as part of the OCR workflow and these hashes populate the load file hash field directly. The signed JSON sidecar provides the authoritative reference for each document's hash with the audit trail showing when the hash was computed and by whom.

Can OCR text be excluded under FRE 901?

OCR-extracted text can be excluded under Federal Rule of Evidence 901 (authentication and identification) when the producing party cannot demonstrate that the extracted text is what the producing party claims it is. The authentication challenge typically targets the OCR methodology: how do we know this text accurately represents the source image. A defensible production answers the challenge with the source image, the source image hash computed at extraction time, the extracted text with per-character confidence scoring, the examiner attestation and the chain-of-custody documentation linking source to extraction through the cryptographic hash chain. A production without this documentation has to construct the defense reactively, often months after extraction, using whatever audit trail can be reconstructed. Reactive reconstruction often fails or is challenged successfully and the production is excluded.

Scanned Document Production for Litigation: A Practical Guide

Modern litigation productions are dominated by born-digital content including email, Office documents, chat exports and database extracts. The exception is the scanned-document subset: physical paperwork that someone scanned, faxed records, photographs of documents, archived correspondence from before the digital-native era and contracts produced as image-only PDFs.

Scanned documents require OCR to become searchable, reviewable and producible at scale. The OCR step is also where chain of custody is either preserved or destroyed. This guide is for the litigation support manager or paralegal producing scanned document evidence for litigation.

The Production Lifecycle for Scanned Documents

A typical scanned document production runs through six stages:

Acquisition. Scanned documents arrive from the custodian. Could be physical scans from a copier, photos taken on a phone, faxed records or pre-existing image-only PDFs in archive directories.
Forensic intake. Each source file is hashed (SHA-256) at receipt with chain of possession documented. The custodian, the date, the source media and the examiner identity all recorded.
OCR extraction. Text content is extracted from each source image. Per-page hashes are computed and tied back to the source file hash.
Review. Paralegals or contract reviewers read the OCR-extracted content for relevance, responsiveness and privilege. Tags applied per the review protocol.
Redaction and Bates stamping. Privileged content redacted. Bates numbers applied per the production protocol.
Production. Bates-stamped redacted PDF set delivered to opposing counsel with a load file mapping each PDF to its metadata fields.

Each stage has discrete chain-of-custody obligations. The OCR stage is where the documentation diverges between non-evidentiary and evidentiary workflows.

The Chain of Custody Gap at the OCR Stage

In a non-evidentiary workflow, OCR is treated as a format conversion where the source image becomes searchable text without further documentation. For internal information retrieval that is fine. For litigation production it creates an evidentiary gap.

The gap: a reviewer reading OCR-extracted text in the review platform cannot verify the text accurately represents the source image without re-running OCR. The text might contain misreadings, the page might be partial or the formatting might have been lost. If the production is later relied upon in deposition or motion practice, opposing counsel can challenge the authenticity of the extracted text and the producing party has to defend the OCR methodology after the fact.

A defensible OCR workflow closes the gap by hashing the source-and-output at extraction time. Each extracted page carries a SHA-256 of the source image and a SHA-256 of the OCR output. A reviewer or opposing expert can verify the chain at any later point by re-hashing the source image and confirming the hash matches the chain-of-custody documentation.

The Sherlock OCR Reader Workflow for Litigation Production

The workflow that pairs cleanly with a review-platform production pipeline:

Source intake. Hash each source file at receipt. Document chain of possession.
Open in Sherlock OCR Reader Forensic Edition. The tool identifies pages requiring OCR vs pages with embedded text.
Run the OCR pass. Per-character confidence scoring computed at extraction. Per-page SHA-256 computed and stored.
Review the extraction. Pages with low overall confidence flagged for human verification. Examiner corrections logged.
Generate the forensic PDF report. Branded report with source file metadata, per-page extraction summary, confidence statistics, examiner attestation and chain-of-custody footer.
Export the searchable PDF. Output PDF with embedded text matching source image positions. Bates-stampable in the downstream review platform.
Production set assembly. Source image plus searchable PDF plus forensic PDF report plus signed JSON sidecar with per-page hashes.
Review platform ingestion. The searchable PDF imports into Relativity, Logikcull, Concordance, Reveal or Everlaw with the OCR text mapped to extracted-text fields. The forensic PDF and JSON sidecar archive in the production-documentation directory.
Bates stamping in the review platform. Standard Bates numbering applied to the searchable PDF set during the review process. The Bates stamp does not modify the source image or the chain-of-custody documentation. The Bates stamp modifies the production-set PDF only.
Production delivery. Bates-stamped PDF set plus load file (mapping each PDF to its metadata) plus production cover letter delivered to opposing counsel. The forensic PDF report and signed JSON sidecar archive on the producing party's side for any future authentication challenge.

What Goes in the Load File

A production load file typically maps each PDF to:

Bates begin number
Bates end number
Custodian
Document date
Source file path (within the production)
Extracted text path (the .txt file with the OCR output, if exported separately)
Per-document hash (typically SHA-1 historically or SHA-256 in modern productions)

Sherlock's per-page SHA-256 hashes can populate the load file's hash field directly. The signed JSON sidecar provides the authoritative reference for each document's hash with the audit trail showing when the hash was computed and by whom.

When the Production Is Challenged

If opposing counsel challenges the authenticity of the extracted text (common in matters where the producing party's interest is to suppress specific text content) the defending party needs:

The original source image, unmodified
The hash of the source image computed at extraction time
The OCR-extracted text, with per-character confidence scoring available for review
The examiner's attestation that the extraction was performed without modification to the source
The chain-of-custody documentation linking the source image to the extracted text through the cryptographic hash chain

Sherlock OCR Reader Forensic Edition produces items 2 through 5 in a single workflow. Item 1 is preserved from the original intake.

The defense to an authentication challenge becomes: "Here is the source image. Here is its hash. Here is the hash recorded at extraction time. Here is the examiner's attestation. The chain matches. The extraction is authentic."

A production without the chain documentation has to construct this defense reactively, often months after the original extraction, using whatever audit trail the producing party can reconstruct. That reconstruction often fails or is challenged successfully. The production is then excluded under FRE 901 or related authentication rules.

When Sherlock Is the Right Choice for This Workflow

Litigation productions involving scanned document evidence
E-discovery vendors producing for client matters where defensibility is required
In-house legal departments handling internal document productions
Compliance teams producing for regulatory inquiry
Forensic consultants engaged by outside counsel for document examination

For these scenarios, the $67 lifetime cost is below the threshold of any procurement review and pays back the first time an authentication challenge is filed or anticipated.

When Adobe Acrobat or Generic OCR Is Sufficient

Internal document review with no anticipated production scrutiny
Document workflow that requires Acrobat's editing or signature capabilities alongside OCR
One-off conversions for personal or internal use
Productions where the parties have stipulated to the authenticity of the OCR output

For these scenarios, Acrobat or Tesseract handles the work and Sherlock is overspending. The full Acrobat vs Sherlock breakdown is covered in the vendor comparison page.