OCR — Turn Scanned PDFs into Searchable Text — MakeMyPDF

You scan a stack of paper documents to PDF. The pages look fine — crisp images of the original text. But try to search for a word, copy a paragraph, or highlight a sentence, and nothing happens. The PDF is just a collection of page-sized images. The computer has no idea there are words on those pages.

OCR — optical character recognition — bridges that gap. It analyzes the pixel patterns in each page image, identifies characters and words, and embeds an invisible text layer into the PDF. The visual appearance stays the same, but now the file is searchable, selectable, and indexable. This guide covers how OCR actually works, what tools do it well, where it falls short, and how to get the best results.

What OCR actually does to a PDF

A scanned PDF stores each page as a raster image — typically JPEG or CCITT-compressed TIFF data embedded in the PDF container. There is no text content, no font data, no character encoding. When you "search" this PDF, the viewer has nothing to search through.

OCR processing adds a text layer on top of each page image. The engine identifies every character it can find, determines its position on the page, and writes invisible text objects at the corresponding coordinates. The original image remains untouched beneath. When you search, select, or copy text from the resulting file, you're interacting with this invisible text layer — the visual rendering still comes from the image.

This approach — overlaying text on images — is called "sandwich PDF" or "PDF/A with OCR text." It's the standard output format for every serious OCR tool. The alternative — extracting text and discarding the images — loses formatting and layout information, so it's rarely what you want for document archival.

How OCR engines recognize text

Modern OCR is a multi-stage pipeline. Understanding the stages helps explain why certain documents OCR well and others don't.

1. Preprocessing

Before character recognition begins, the engine cleans up the image. This includes deskewing (straightening tilted scans), binarization (converting to black and white for clearer character boundaries), noise removal (eliminating speckles and artifacts), and sometimes dewarping (flattening curved text from book spines or folded pages). The quality of preprocessing directly affects recognition accuracy — a clean, straight, high-contrast image produces far better results than a dark, skewed, noisy one.

2. Layout analysis

The engine identifies regions of the page: text blocks, columns, tables, images, headers, footers. This matters because reading order depends on layout. A two-column academic paper needs to be read column by column, not straight across. Tables need their cells identified to preserve structure. Layout analysis is where OCR engines differ most — getting columns and tables right is hard, and mistakes here scramble the output even if every individual character is recognized correctly.

3. Character recognition

Each text region is segmented into lines, then words, then individual characters. Traditional OCR matched character shapes against known templates. Modern engines — including Tesseract 4 and later — use neural networks (specifically LSTM-based models) trained on millions of text samples. The network processes entire lines at a time rather than individual characters, which improves accuracy because context helps disambiguate similar shapes. Is that an "l", an "I", or a "1"? The surrounding letters provide clues.

4. Post-processing

After recognition, some engines apply dictionary-based correction, confidence scoring, and language model checks. If the engine reads "teh" but the dictionary says that's not a word and "the" is a common one with nearly identical character shapes, it may correct the output. This step is language-dependent — engines with strong language models for a given language perform measurably better on text in that language.

Tesseract: the open-source standard

Tesseract is the most widely used open-source OCR engine. Originally developed at HP Labs in the 1980s, it was open-sourced in 2005 and later maintained by Google. Version 4 (released 2018) added LSTM-based neural network recognition, and version 5 is the current release.

Tesseract supports over 100 languages out of the box. You download trained data files for each language you need, and Tesseract uses the appropriate model during recognition. For many languages, multiple model variants exist: a "fast" model optimized for speed and a "best" model optimized for accuracy.

Running Tesseract directly from the command line:

tesseract scanned.pdf output pdf
# For a specific language:
tesseract scanned.pdf output -l fra pdf
# For multiple languages:
tesseract scanned.pdf output -l eng+fra+deu pdf

Tesseract's strength is accuracy on clean, well-scanned documents. Its weakness is layout analysis — complex multi-column layouts, tables, and mixed text/image pages can trip it up. For straightforward single-column documents, it's excellent.

Other OCR engines worth knowing

Adobe Acrobat

Acrobat Pro's OCR is reliable and handles complex layouts better than Tesseract. It integrates directly into the PDF editing workflow — scan, OCR, and edit in one application. The downside is cost: Acrobat Pro requires a subscription. For occasional use, it's hard to justify. For daily document processing, it earns its keep.

ABBYY FineReader

ABBYY is the gold standard for commercial OCR accuracy, especially on degraded documents, complex layouts, and unusual fonts. Their engine powers many enterprise document processing systems. If you process thousands of documents daily and accuracy on difficult scans matters, ABBYY is worth evaluating. For personal use, it's overkill.

OCRmyPDF

OCRmyPDF is a Python tool that wraps Tesseract and adds smart PDF handling. It detects whether a PDF already has text (skipping OCR on those pages), optimizes the output file size, produces PDF/A-compliant output, and handles page rotation automatically. If you're going to use Tesseract, using it through OCRmyPDF is almost always better than calling Tesseract directly:

ocrmypdf input.pdf output.pdf
# Force OCR even on pages that already have text:
ocrmypdf --force-ocr input.pdf output.pdf
# Specify language:
ocrmypdf -l deu input.pdf output.pdf

Cloud APIs

Google Cloud Vision, AWS Textract, and Azure AI Document Intelligence all offer OCR as a service. These cloud engines handle layout analysis, table extraction, and handwriting recognition better than most local tools. The tradeoff is privacy: your documents go to a third party's servers. For sensitive documents, that may not be acceptable. For bulk processing of non-sensitive material, cloud APIs are fast and accurate.

Language support and accuracy

OCR accuracy varies dramatically by language and script. Latin-alphabet languages (English, French, German, Spanish) are well-supported by every engine, with accuracy typically above 98% on clean scans. CJK scripts (Chinese, Japanese, Korean) are supported but harder — the character sets are vastly larger, and character density is higher. Arabic, Hebrew, and other right-to-left scripts work but require specific language models and can struggle with connected cursive forms. Handwritten text in any language remains the hardest challenge; even the best engines achieve significantly lower accuracy than on printed text.

For multi-language documents — a French legal document with English citations, for example — you need to specify all relevant languages during processing. Most engines handle this, but accuracy on the secondary language may drop slightly.

Getting the best OCR results

The single biggest factor in OCR quality is scan quality. No engine can reliably read text that a human would squint at. Here are the controllable factors:

Scan at 300 DPI minimum. 200 DPI is borderline; 150 DPI will produce noticeably more errors. For small text (footnotes, fine print), 400–600 DPI is better. Higher than 600 DPI rarely helps and just increases file size.
Use grayscale or black-and-white mode.Color scans don't improve OCR accuracy and produce larger files. Grayscale is the default choice; pure black-and-white (1-bit) works well for high-contrast documents and produces the smallest files.
Keep the glass clean. Smudges, fingerprints, and dust on the scanner glass create artifacts that degrade recognition. This sounds trivial but causes real problems on batch scans.
Align pages straight.Most OCR engines can correct minor skew, but significant rotation (more than a few degrees) hurts accuracy. Use the scanner's alignment guides.
Avoid heavy compression on the scan.Aggressive JPEG compression introduces artifacts around character edges — exactly where the OCR engine needs sharp boundaries. Use moderate compression or lossless formats for the scan stage; you can compress the final OCR'd PDF later.

When OCR goes wrong

OCR is not perfect, and knowing its failure modes helps you decide when to trust the output and when to proofread.

Similar-looking characters:"rn" vs "m", "cl" vs "d", "0" vs "O". Context helps, but proper nouns, URLs, and code snippets don't benefit from dictionary correction.
Tables and forms: Cell boundaries may not be detected correctly, causing text from adjacent cells to merge or appear in the wrong order. Tables with no visible borders are especially problematic.
Watermarks and background patterns: Text overlaid on colored backgrounds, watermarks, or background images is harder to isolate and recognize.
Mixed content: Pages with a combination of text, images, diagrams, and captions require accurate layout analysis. Mistakes in layout segmentation cascade into garbled output.
Low-quality originals:Faded ink, photocopied-of-a-photocopy degradation, crumpled or stained pages — these defeat preprocessing and degrade recognition. There's a floor below which no software can compensate for bad input.

Using MakeMyPDF for OCR

Our OCR PDF tool processes scanned PDFs through a server-side OCR engine. Upload your scanned PDF, select the document language, and the tool returns a searchable PDF with an embedded text layer. The original page images are preserved — you get a sandwich PDF that looks identical to the input but supports search, copy, and text selection.

OCR requires server-side processing because the recognition engine and trained language models are too large to run in a browser. This means your file is uploaded for processing, unlike our client-side tools that keep everything local. The file is processed and deleted immediately — it is not stored or retained after the OCR'd result is returned to you.

PDF/A and long-term archival

If you're scanning documents for archival, consider producing PDF/A output. PDF/A is an ISO-standardized subset of PDF designed for long-term preservation. It requires that all fonts be embedded, forbids encryption and external dependencies, and ensures the document is self-contained and renderable decades from now.

OCRmyPDF produces PDF/A output by default. Most commercial OCR tools offer it as an option. For legal, medical, or government document archival, PDF/A compliance may be a regulatory requirement — check before you set up a scanning workflow.

FAQ

How can I tell if a PDF has already been OCR'd?

Open the PDF and try to select text. If you can highlight individual words and copy them to your clipboard, the PDF already has a text layer — either because it was created digitally or because OCR has already been applied. If clicking and dragging selects the entire page as an image (or selects nothing), the PDF is image-only and needs OCR.

Does OCR change how the PDF looks?

No. The sandwich PDF approach preserves the original page images exactly as they are. The text layer is invisible — it sits behind the images and is only used for search and selection. The visual appearance of every page remains unchanged.

How accurate is OCR on typical business documents?

On clean, 300 DPI scans of standard printed text in English or other Latin-script languages, modern OCR engines achieve 98–99% character accuracy. That sounds high, but on a page with 3,000 characters, 1% error means 30 wrong characters — enough to cause problems if you're relying on the text for legal or financial purposes. Always proofread OCR output on critical documents.

Can OCR handle handwritten text?

Printed handwriting (like block capital letters on a form) works reasonably well. Cursive handwriting is much harder — accuracy drops significantly, and results vary widely depending on handwriting legibility. Cloud APIs (especially Google and Azure) handle handwriting better than local tools, but none are reliable enough to skip manual review.

What about OCR on photos instead of scans?

Phone camera captures work, but quality matters more than with flatbed scans. Uneven lighting, perspective distortion, motion blur, and curved pages from books all reduce accuracy. If you're using a phone camera as a scanner, use a dedicated scanning app that applies perspective correction and contrast enhancement before OCR. Our Scan to PDF tool captures photos and converts them to PDF format, which you can then run through OCR processing.