Search code examples
datasetocrtesseract

Resources containing OCR benchmark test-sets for free


I want to do an OCR benchmark for scanned text (typically any scan, i.e. A4). I was able to find some NEOCR datasets here, but NEOCR is not really what I want.

I would appreciate links to sources of free databases that have appropriate images and the actual texts (contained in the images) referenced.

I hope this thread will also be useful for other people doing OCR surfing for datasets, since I didn't find any good reference to such sources.

Thanks!


Solution

  • I've had good luck using university research data sets in a number of projects. These are often useful because the input and expected results need to be published to independently reproduce the study results. One example is the UNLV data set for the Fourth Annual Test of OCR Accuracy discussed more below.

    Another approach is to start with a data set and create your own training set. It may also be worthwhile to work with Project Gutenberg which has transcribed 57,136 books. You could take the HTML version (with images) and print it out using a variety of transformations like fonts, rotation, etc. Then you could convert the images and scan them to compare against the text version. See an example further below.

    1) Annual Tests of OCR Accuracy DOE and UNLV

    The Department of Energy (DOE) and Information Science Research Institute (ISRI) of UNLV ran OCR tests for 5 years from 1992 to 1995. You can find the study descriptions for each year here:

    1.1) UNLV Tesseract OCR Test Data published in Fourth Annual Test of OCR Accuracy

    The data for the fourth annual test using Tesseract is posted online. Since this was an OCR study, it may suit your purposes.

    This data is now hosted as part of the ISRI of UNLV OCR Evaluation Tools project posted on Google Code:

    Images and Ground Truth text and zone files for several thousand English and some Spanish pages that were used in the UNLV/ISRI annual tests of OCR accuracy between 1992 and 1996.

    Source code of OCR evaluation tools used in the UNLV/ISRI annual tests of OCR Accuracy.

    Publications of the Information Science Research Institute of UNLV applicable to OCR and text retrieval.

    You can find information on this data set here:

    At the datasets link, you'll find a number of gziped tarballs you can download. In each tarball is a number of directories with a set of files. Each document has 3 files:

    • .tif binary image file
    • .txt text file
    • .uzn zone file for describing the scanned image

    Note: while posting, I noticed this data set was originally posted in a comment by @Stef above.

    2) Project Gutenberg

    Project Gutenberg has transcribed 57,136 free ebooks in the following formats:

    • HTML
    • EPUB (with images)
    • EPUB (no images)
    • Kindle (with images)
    • Kindle (no images)
    • Plain Text UTF-8

    Here is an example: http://www.gutenberg.org/ebooks/766

    You could create a test data set by doing the following:

    Create test files:

    1. Start with HTML, ePub, Kindle, or plain text versions
    2. Render and transform using different fonts, rotation, background color, with and without images, etc.
    3. Convert the rendering to the desired format, e.g. TIFF, PDF, etc.

    Test:

    1. Run generated images through OCR system
    2. Compare with original plain text version