I've got a scanned document and I would like to use Tesseract to get the text from it.
Here is an example of my PDF quality:
As you can see for "maintenance" there is a little dot above the "c". Tesseract translates this word into: "mafintenanée" with the following commands:
tesseract 1.pdf final -l eng --oem 2
tesseract 1.pdf final -l eng --oem 1
tesseract 1.pdf final -l eng
I can't afford this kind of detection, so I've tried to improve my PDF with imagemagick.
I've tried all the following commands:
convert 1.pdf -resize 400% outResize400.tif
convert 1.pdf -quality 100 out.tif
convert 1.pdf -quality 100 outquality100.tif
convert 1.pdf -background white backgroundwhite.tif
convert 1.pdf -density 200x200 density200x200.tif
convert 1.pdf -density 200x200 density200.jpg
convert 1.pdf -antialias antialias.tif
convert 1.pdf -background white -density 800 backgroundwhitewithdensity800.tif
convert 1.pdf -density 400% density400percent.tif
One of the best results I get it this:
As you can see text is totally destroyed with imageMagick.
Do you have any idea of the settings I should use to improve my results?
As requested by Vico:
You typically need to specify the -density XXX before reading a vector file such as PDF. So typically, one can do
convert -density 288 1.pdf -resize 25% 1.tiff
Nominal density is 72 dpi, so 288=4*72 and 25% is 1/4. So this reads this PDF at high density and then resizes back to the input size. If you want larger characters, then either change the density to something larger or remove the -resize. If the scans are not clean, then we would need to see the actual PDF to suggest further processing, which might depend on the density assigned.