Optical character recognition, usually abbreviated to OCR, is a method to convert scanned printed/handwritten image files into its machine readable text format. It is widely used to convert books and documents into electronic files, or to publish the text on a website.
What's OCR used for?
The most cases happen in people's life that need to use OCR is: the files are scanned with a scanner and saved as PDF or image formats, you can do nothing but view it, you are not allowed to search for a specific word or phase inside the scanned documents. Even, the computer sometimes cannot recognize some special words in a scanned file. Why a scanned file is not searchable? Just because the file is a picture or photograph and the text content is not searchable or editable.
But with an OCR software, it is a piece of cake. You can easily convert scanned files to searchable and editable text format.
The limitations of OCR
OCR has never achieved a read rate that is 100% perfect. The success of any OCR device to read accurately without substitutions is not the sole responsibility of the hardware manufacturer. Much depends on the quality of the items to be processed. Below list highlights some of the common scenarios which OCR finds problematic:
- - Processing images containing very small text (smaller than 10 points).
- - Images scanned from stained, crumpled, or colored paper.
- - Low quality images with grainy or faded text.
- - Images with skewed or warped text.
- - Images with mixed content (text, images, and graphics all in the one page).
Tips to make output accurate for OCR
Though there are some limitations OCR, we can do some best practices and workarounds to ensure the final output is consistent with the scanned original:
- - Set the scanner color settings to Grayscale, or Black and White if the text is black against a white background.
- - If supported by your scanner, adjust brightness and contrast to achieve deep blacks and bright whites.
- - Set the scan quality (resolution) to 300dpi or better.
- - Start with a good original document. Wrinkles and creases might hinder OCR accuracy.
- - Ensure scanner glass is clean and free from smudges.
- - Keep your pages as straight as possible during scanning. Skewed pages require more processing in the OCR engine.
- - Depending on the quality of your scanner, you might need to attempt several scans of the same document to process the best resulting image.
- - If your text is on a patterned or colored background, try to obtain another version on a plain white background. Text against colored backgrounds or gradients will require several attempts with different settings until the right configuration for successful OCR is found.
- - Some smudges can be manually repaired by using white correction fluid to cover unwanted artifacts.
- - If supported by your scanner, enable the despeckle setting to remove noise from your image.
- - If supported by your scanner, increase text smoothing to remove harsh blends and grain.

