Search code examples
javamime-typesapache-tika

Extracting attachment name from eml file using Content-Type header


I'm using Tika-server to parse bunch of eml files. Extracting both content and metadata of emls and attachments works fine while using /rmeta endpoint.

Problem occurs with proper attachment file name. When attachment part in raw eml file has got a following structure:

Content-Type: application/pdf; name="filename_a.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="filename_a.pdf"

everything works fine: extracted filename path in metadata object (in api response) is:

"X-TIKA:embedded_resource_path": "/filename_a.pdf"

However some of my emails have got malformed header structure (missing filename in Content-Disposition) i.e.:

Content-Type: application/pdf; name="filename_a.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;

Then after parsing the whole eml I obtain:

"X-TIKA:embedded_resource_path": "/embedded-1"

I checked in Tika's source code that filename meta is defined in \org\apache\tika\parser\RecursiveParserWrapper.class here:

 private String getResourceName(Metadata metadata, RecursiveParserWrapper.ParserState state) {
        String objectName = "";
        if (metadata.get("resourceName") != null) {
            objectName = metadata.get("resourceName");
        } else if (metadata.get("embeddedRelationshipId") != null) {
            objectName = metadata.get("embeddedRelationshipId");
        } else {
            objectName = "embedded-" + ++state.unknownCount;
        }

        objectName = FilenameUtils.getName(objectName);
        return objectName;
    }

I was trying to access somehow mentioned filename attribute by inspecting Content-Type key in metadata object but it's not there. (I assume that Tika assess Content-type key not just by looking into proper header hence needed filename is absent)

Therefore my question (since I'm not able to figure it out) is there a way to modify Tika source code to force filename extraction from Content-Type header when proper filename attribute in Content-Disposition header is missing?


Solution

  • Ok, so I managed on my own. The workaround is preety simple and straightforward.

    One has to extend one of the conditions in \org\apache\tika\parser\mail\MailContentHandler.class. In line 129 we have:

    if (contentDispositionFileName != null) {
       submd.set("resourceName", contentDispositionFileName);
    }
    

    By extending with additional else block:

    if (contentDispositionFileName != null) {
       submd.set("resourceName", contentDispositionFileName);
    } else {
        Map<String, String> contentTypeParameters = ((MaximalBodyDescriptor)body).getContentTypeParameters();
        String contentTypeFilename = (String)contentTypeParameters.get("name");
        submd.set("resourceName", contentTypeFilename);
    }
    

    we enforce the handler to look for an additional filename property in content type parameters.