Search code examples
pdflayoutsolrapache-tikarequesthandler

Solr: store Text Layout from extrected pdf with tika / extract request handler


i'm using solr 4 and the extract request handler to index pdf files, which works well. The text from the pdf is stored in the index in oder to display/provide an text snipped with highlighting.

The problem is, that the layout of the stored text is lost in solrs stored fiels. For example, if the pdf content is:

 left text                       right text
 2nd. line leftr text            text at the right side

....the content of the stored field lookes like that:

 left text right text
 2nd. line leftr text text at the right side

On the other hand: if i extrat the pdf to text (using linux tool pdftotext) followed by indexing the textfile (instead the pdf) using the extract request hendler -> the stored field contains/includes the layout. So the text snipped (and the content of the stored field in solr) lookes like that:

 left text                       right text
 2nd. line leftr text            text at the right side

My Question: Is there a way to keept the layout also while indexing an pdf, not only an text file?


Solution

  • Apache Tika would extract all the text from the pdf and index the contents as a text file.
    But Instead of using the ExtractHandler with Tika, you can always convert the pdf to text and get it index so that you have the text with layout and have search enabled over it.
    You can also check if you can change the default handling of Apache Tika probably using PDFBox to use other converter which holds the text layout.