I'm using Tika-server to parse bunch of eml files. Extracting both content and metadata of emls and attachments works fine while using /rmeta
endpoint.
Problem occurs with proper attachment file name. When attachment part in raw eml file has got a following structure:
Content-Type: application/pdf; name="filename_a.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="filename_a.pdf"
everything works fine: extracted filename path in metadata object (in api response) is:
"X-TIKA:embedded_resource_path": "/filename_a.pdf"
However some of my emails have got malformed header structure (missing filename in Content-Disposition) i.e.:
Content-Type: application/pdf; name="filename_a.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
Then after parsing the whole eml I obtain:
"X-TIKA:embedded_resource_path": "/embedded-1"
I checked in Tika's source code that filename meta is defined in \org\apache\tika\parser\RecursiveParserWrapper.class here:
private String getResourceName(Metadata metadata, RecursiveParserWrapper.ParserState state) {
String objectName = "";
if (metadata.get("resourceName") != null) {
objectName = metadata.get("resourceName");
} else if (metadata.get("embeddedRelationshipId") != null) {
objectName = metadata.get("embeddedRelationshipId");
} else {
objectName = "embedded-" + ++state.unknownCount;
}
objectName = FilenameUtils.getName(objectName);
return objectName;
}
I was trying to access somehow mentioned filename attribute by inspecting Content-Type key in metadata object but it's not there. (I assume that Tika assess Content-type key not just by looking into proper header hence needed filename is absent)
Therefore my question (since I'm not able to figure it out) is there a way to modify Tika source code to force filename extraction from Content-Type header when proper filename attribute in Content-Disposition header is missing?
Ok, so I managed on my own. The workaround is preety simple and straightforward.
One has to extend one of the conditions in \org\apache\tika\parser\mail\MailContentHandler.class. In line 129 we have:
if (contentDispositionFileName != null) {
submd.set("resourceName", contentDispositionFileName);
}
By extending with additional else block:
if (contentDispositionFileName != null) {
submd.set("resourceName", contentDispositionFileName);
} else {
Map<String, String> contentTypeParameters = ((MaximalBodyDescriptor)body).getContentTypeParameters();
String contentTypeFilename = (String)contentTypeParameters.get("name");
submd.set("resourceName", contentTypeFilename);
}
we enforce the handler to look for an additional filename property in content type parameters.