Search code examples
tesseract

Convert scanned pdf to .txt files using tesseract


I have to convert a .pdf file containing scanned images into .txt files. The tesseract ocr converts only images to .txt, but I need to first extract the .tif images and then convert it. Can anyone help me with this?


Solution

  • Use Imagemagick:

    convert -density 600 input.pdf output.tif
    

    Density is in DPI, from my experience 600 DPI works the best.