Search code examples
javaapache-tika

Is it possible to extract table infomation using Apache Tika?


I am looking at a parser for pdf and MS office document formats to extract tabular information from files. Was thinking of writing separate implementations when I saw Apache Tika. I am able to extract full text from any of these file formats. But my requirement is to extract tabular data where I am expecting 2 columns in a key value format. I checked most of the stuff available in the net for a solution but could not find any. Any pointers for this?


Solution

  • Well I went ahead and implemented it separately using apache poi for the MS formats. I came back to Tika for PDF. What Tika does with the docs is that it will output it as "SAX based XHTML events"1

    So basically we can write a custom SAX implementation to parse the file.

    The structure text output will be of the form (Meta details avoided)

    <body><div class="page"><p/>
    <p>Key1 Value1 </p>
    <p>Key2 Value2 </p>
    <p>Key3 Value3</p>
    <p/>
    </div>
    </body>
    

    In our SAX implementation we can consider the first part as key (for my problem I already know the key and I am looking for values, so it is a substring).

    Override public void characters(char[] ch, int start, int length) with the logic

    Please note for my case the structure of the content is fixed and I know the keys that are coming in, so it was easy doing it this way. This is not a generic solution