Search code examples
pythonapache-tika

Apache Tika Server: get macros from office documents?


I'm using Apache Tika as service to analyze Office documents in Python, like so:

url = 'http://{0}:{1}/rmeta/xml'
url = url.format(self._host, self._port)
res = requests.put(url, data=dat).json()

I'd like to extract the content of macros from the documents if the document contains macro, but can't figure out how to do it. Apache Tika documentation is not that good. Is there any header or something I need to use to make Tika server return macro content as well as the content of the document?


Solution

  • As far as I understood the problem is that Tika by default doesn't extract macros from Office Documents. In order to make it do exactly that I had to make a custom config file for Tika, enabling extractMacros property to both Microsoft Office parsers implemented in Tika (I don't know if they use POI or something). Here is an example how to do it: https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/org/apache/tika/parser/microsoft/tika-config-macros.xml