I want to do an OCR benchmark for scanned text (typically any scan, i.e. A4). I was able to find some NEOCR datasets here, but NEOCR is not really what I want.
I would appreciate links to sources of free databases that have appropriate images and the actual texts (contained in the images) referenced.
I hope this thread will also be useful for other people doing OCR surfing for datasets, since I didn't find any good reference to such sources.
Thanks!
I've had good luck using university research data sets in a number of projects. These are often useful because the input and expected results need to be published to independently reproduce the study results. One example is the UNLV data set for the Fourth Annual Test of OCR Accuracy discussed more below.
Another approach is to start with a data set and create your own training set. It may also be worthwhile to work with Project Gutenberg which has transcribed 57,136 books. You could take the HTML version (with images) and print it out using a variety of transformations like fonts, rotation, etc. Then you could convert the images and scan them to compare against the text version. See an example further below.
1) Annual Tests of OCR Accuracy DOE and UNLV
The Department of Energy (DOE) and Information Science Research Institute (ISRI) of UNLV ran OCR tests for 5 years from 1992 to 1995. You can find the study descriptions for each year here:
1.1) UNLV Tesseract OCR Test Data published in Fourth Annual Test of OCR Accuracy
The data for the fourth annual test using Tesseract is posted online. Since this was an OCR study, it may suit your purposes.
This data is now hosted as part of the ISRI of UNLV OCR Evaluation Tools project posted on Google Code:
Images and Ground Truth text and zone files for several thousand English and some Spanish pages that were used in the UNLV/ISRI annual tests of OCR accuracy between 1992 and 1996.
Source code of OCR evaluation tools used in the UNLV/ISRI annual tests of OCR Accuracy.
Publications of the Information Science Research Institute of UNLV applicable to OCR and text retrieval.
You can find information on this data set here:
At the datasets link, you'll find a number of gziped tarballs you can download. In each tarball is a number of directories with a set of files. Each document has 3 files:
.tif
binary image file.txt
text file.uzn
zone file for describing the scanned imageNote: while posting, I noticed this data set was originally posted in a comment by @Stef above.
2) Project Gutenberg
Project Gutenberg has transcribed 57,136 free ebooks in the following formats:
Here is an example: http://www.gutenberg.org/ebooks/766
You could create a test data set by doing the following:
Create test files:
Test: