Search code examples
javaspringapache-tika

Java/Spring: How to Figure out MimeType on an InputStream Without Consuming It


BASICS

This is a Java 1.8 Spring Boot 1.5 Application.

It currently uses Apache Tika 1.22 to read Mime-Type information, but this can easily be changed.

SUMMARY

There is a mapper which User uses to download files. These files come from another URL separate from the application. The file may be a variety of types (excel, PDF, text, etc), and the application has no way of knowing what it will be until it pulls the file down.

ISSUE

In order to return the file download to User with the appropriate title, extension, and ContentType, the application uses Apache Tika to pull that information. Unfortunately, now that the header of the InputStream is consumed, when the application writes the InputStream to the HttpServletResponse, the file is incomplete.

This means that, in order to function currently, the application closes the first InputStream and then opens a second InputStream to return to User.

That's not good, because it means that the URL is being called twice, wasting system resources.

What is the proper way to have this function?

CODE EXAMPLE

    @GetMapping("/My/Download/")
    public void doDownload(HttpServletResponse httpServletResponse) {

            String externalFileURL = "http://www.pdf995.com/samples/pdf.pdf";

            try {       
                InputStream firstStream = new URL(externalFileURL).openStream();        
                TikaConfig tikaConfig = new TikaConfig();
                MediaType mediaType = tikaConfig.getDetector().detect(TikaInputStream.get(firstStream), new Metadata());
                firstStream.close();

                InputStream secondStream = new URL(externalFileURL).openStream();   
                httpServletResponse.setHeader("Content-Disposition", String.format("attachment; filename=\"%s\"", "DownloadMe." + mediaType.getSubtype()));
                httpServletResponse.setContentType(mediaType.getBaseType().toString());
                FileCopyUtils.copy(secondStream, httpServletResponse.getOutputStream());
                httpServletResponse.flushBuffer();
            } catch (Exception e) {

            }
    }

Solution

  • Javadoc of detect() says:

    The given stream is guaranteed to support the mark feature and the detector is expected to mark the stream before reading any bytes from it, and to reset the stream before returning.

    Javadoc of TikaInputStream says:

    The created TikaInputStream instance keeps track of the original resource used to create it, while behaving otherwise just like a normal, buffered InputStream. A TikaInputStream instance is also guaranteed to support the mark(int) feature.

    Which means you should use TikaInputStream to read the content, and try-with-resources to close it:

    try (InputStream tikaStream = TikaInputStream.get(new URL(externalFileURL))) {
        TikaConfig tikaConfig = new TikaConfig();
        MediaType mediaType = tikaConfig.getDetector().detect(tikaStream, new Metadata());
    
        httpServletResponse.setHeader("Content-Disposition", String.format("attachment; filename=\"%s\"", "DownloadMe." + mediaType.getSubtype()));
        httpServletResponse.setContentType(mediaType.getBaseType().toString());
        FileCopyUtils.copy(tikaStream, httpServletResponse.getOutputStream());
        httpServletResponse.flushBuffer();
    }