Search code examples
hadoophadoop-streaminghadoop-pluginshadoopy

how to access and manipulate pdf file's datas in Hadoop?


I want to read the PDF file using hadoop, how it is possible? I only know that hadoop can process only txt files, so is there anyway to parse the PDF files to txt.

Give me some suggestion.


Solution

  • An easy way would be to create a SequenceFile to contain the PDF files. SequenceFile is a binary file format. You could make each record in the SequenceFile a PDF. To do this you would create a class derived from Writable which would contain the PDF and any metadata that you needed. Then you could use any java PDF library such as PDFBox to manipulate the PDFs.