Skip to main content

Tips for Improving OCR Results

Tesseract is a library for performing optical character recognition, but it's important to know that Tesseract performs OCR best when it is given a preprocessed image that is ideally crystal clear black text on a pure white background.

The Tesseract library has a Wiki page on some processes that will improve the recognition results here: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

The key points for optimal recognition are:

Use a decently scale image (> 300dpi)
Use a black and white image
Remove as much noise as possible
Use a straight image (not skewed)
Remove any borders