I am processing a file using Tika 2.1 from the command line under Ubuntu 20.04 using the following command:
java -jar tika-app-2.1.0.jar -t test.txt
The file is a pure text ANSI file (all the chars are 0x0 thru 0x7f). As hard as this is to believe, Tika 2.1 app is ignoring all characters when a specific string is present is the text file.
Here is the text file:
From:
Sent:
text
this is a test
testing
next
last
And here is the output:
this is a test
testing
next
last
To show that this is a pure ANSI text file with no formatting, no Unicode 2-bytes sequences, etc., here is the output of the 'od' command:
0000000 7246 6d6f 0d3a 530a 6e65 3a74 0a0d 6574
0000020 7478 0a0d 0a0d 6874 7369 6920 2073 2061
0000040 6574 7473 0a0d 6574 7473 6e69 2067 0a0d
0000060 0a0d 656e 7478 0a0d 616c 7473
However, if I simply change the "Sent:" to "sent:" the output is:
From:
sent:
text
this is a test
testing
next
last
I've been troubleshooting this issue and do not see the connection. If I append "Sent:" to the first line:
From: Sent:
Sent:
text
this is a test
testing
next
last
The results are:
this is a test
testing
next
last
But if I alter "Sent:" to be "\Sent" on the second line, I get this output:
From: Sent:
\Sent:
text
this is a test
testing
next
last
And this file:
From: Sent:
Sent:
Sent:
text
this is a test
testing
next
last
Results in this output:
this is a test
testing
next
last
But if I place "Sent:" in the first line or a simple (0d 0a) as the first two bytes, the output is fine. Why is it the start of the second line seems to matter, as well as the uppercase or lowercase working but not "Sent:"? Why does preceding "Sent" with a "\" make it work? I've also tried this on a different machines - one running Ubuntu 18.04 and running the jar on a Windows 10 system - both with same results.
What is going on with the basic Tika response to a very simply text file? I have not altered the jar in any way. This is the jar file as downloaded from the Apache Tika site. What am I missing?
Any information is very much appreciated.
Tika is interpreting the text as an email. This specific example was a text extraction of an email and includes certain keywords (e.g. "From:" and "Sent:" in specific positions). That is why when other characters are added at the start of the file, it defaults to interpreting it as a pure text file.
I had thought that the order of interpretation was first based on the ".txt" extension and then analysis of the content (which in this case does not have any metadata with this text file). But that does not appear to be the case here. It seems that analysis is the first order, before it considers the ".txt" extension.
The example was being run through Tika running as a server. Going forward I will use the Tika API and follow the suggestions provided by the commenter (@Gagravarr) by skipping a call to AutoDetectParser,setting the content type property on the metadata and call DefaultParser all via the API.
Tx to @Gagravarr for finding a solution.