Search code examples
httpwebrequestapache-tikatika-server

Tika extra space between letters - is there any way to use setEnableAutoSpace via Web API?


I'm running the stock Apache Tika 1.24.1 Server (tika-server-1.24.1.jar). My ASP.NET MVC web app then gets the parsed documents back from Tika using this VB.net code:

httpWebRequest = HttpWebRequest.Create("http://localhost:9998/tika")
httpWebRequest.Method = "PUT"
httpWebRequest.Accept = "text/plain"
httpWebRequest.UseDefaultCredentials = True
httpWebRequest.GetRequestStream.Write(fileContents, 0, fileContents.Count)
httpWebResponse = httpWebRequest.GetResponse

Using contentResponseStream As New StreamReader(_httpWebResponse.GetResponseStream)
    tikaTextContents = contentResponseStream.ReadToEnd()
End Using

That part works (the parsed text is returned).

However, when the Tika server parses certain PDF files, it adds extra spaces in some places. I noticed in this Tika ticket that there's a potential solution (setEnableAutoSpace). https://issues.apache.org/jira/browse/TIKA-724

My question: Is there any way to set setEnableAutoSpace from the Tika web interface (or possibly to set it when you parse the file)? Or is the only option to tinker with the Java code if you want to turn this option on?

Thanks!


Solution

  • In order to set any of the options from PDFParserConfig when making a request to the Tika Server, you need to send a HTTP Header that is prefixed with X-Tika-PDF and then the setting you want to control

    So, to turn on the enabledAutoSpace option when making a request, you should send the header

    X-Tika-PDFenableAutoSpace: true
    

    If enabling that option only partly fixes your PDF text problem, you should have a look at the Tika Troubleshooting PDFs wiki page for next steps. Depending on the software used to generate them, and the options picked, PDFs can be hard....