I have large numbers PDF document, from which I need to extract text. The extracted text I use for further processing. I did this for a small subset of documents using Tesseract API in a linear approach and I get the required output. However, this takes a very long time when I have a large number of documents.
I tried to use the Hadoop environment processing capabilities (Map-Reduce) and storage (HDFS) for solving this issue. However, I am facing problem to implement Tesseract API into the Hadoop (Map-Reduce) approach. As Teserract converts the files into intermediate image files, I am confused as to how intermediate result Image files of Tesseract-API-process can be handled inside HDFS.
I have searched and unsuccesfully tried a few options earlier like:
I have extracted text from PDF by extending FileInputFormat class into my own PdfInputFormat class using Hadoop-Map-Reduce, for this i used Apache PDFBox to extract text from pdf, but when it comes to scanned-pdf's which contains image, this solution does not give me the required results.
I found few answers on the same topic stating to use -Fuse and that will help or one should generate image files locally and than upload those into hdfs for further processing. Not sure if this is the correct approach.
Would like to know approaches around this.
This is an approach found to process multiple pdf's to extract text using the power of the Hadoop Framework, and then use this text for further processing:
This is the current solution. Would like feedback on this.