Search code examples
hadooppdftesseractpdfbox

Need to implement bulk PDF extraction using Tesseract API


I have large numbers PDF document, from which I need to extract text. The extracted text I use for further processing. I did this for a small subset of documents using Tesseract API in a linear approach and I get the required output. However, this takes a very long time when I have a large number of documents.

I tried to use the Hadoop environment processing capabilities (Map-Reduce) and storage (HDFS) for solving this issue. However, I am facing problem to implement Tesseract API into the Hadoop (Map-Reduce) approach. As Teserract converts the files into intermediate image files, I am confused as to how intermediate result Image files of Tesseract-API-process can be handled inside HDFS.

I have searched and unsuccesfully tried a few options earlier like:

  1. I have extracted text from PDF by extending FileInputFormat class into my own PdfInputFormat class using Hadoop-Map-Reduce, for this i used Apache PDFBox to extract text from pdf, but when it comes to scanned-pdf's which contains image, this solution does not give me the required results.

  2. I found few answers on the same topic stating to use -Fuse and that will help or one should generate image files locally and than upload those into hdfs for further processing. Not sure if this is the correct approach.

Would like to know approaches around this.


Solution

  • This is an approach found to process multiple pdf's to extract text using the power of the Hadoop Framework, and then use this text for further processing:

    1. Put all the PDFs to be converted to text in one folder.
    2. Create one text file per pdf to contain the path to the pdf. e.g. if I have 10 pdfs to convert, then I have 10 text files generated, each containing the unique path to the respective pdf.
    3. These text files are given as input in the map-reduce program
    4. Because input file size is very small only 1 input split is generated by framework for 1 input. e.g if I have 10 pdfs as input, then framework will generate 10 input-split.
    5. From each Input-split one line(record) is read by Record-Reader and passed to one mapper as a value. So if there are 10 records(line==File Path) in input text file , 10 times mapper will run. As I have one record per input-split so one mapper-reducer is assigned to do task for that input-split.
    6. As I have 10 input-split 10 mapper will run, parallel.
    7. Inside the Mapper ghost-script generates images, passing the file name from Mapper value attribute. The image is converted to text using Tesseract inside the mapper itself to get the text of each pdf. This is the output.
    8. This is passed to the reducer to do other analytics work as required.

    This is the current solution. Would like feedback on this.