Search code examples
javapdfzipfilesizeapache-tika

Progress reporting with Apache Tika?


I am using Apache Tika with Java to extract text from PDF and Zip files. Now while processing large files, I want to add progress reporting to my application. For that I need estimated extraction size to calculate the percentage done (by matching it with the number of bytes written to output).

I've searched a lot and cannot find anything related to this anywhere.

Does apache tika provide any kind of progress reporting? Is there any workaround for this?

Edit: I'm using the java libraries of apache tika tika-parsers, tika-server from the group org.apache.tika. And directly invoking them through Java with the following code.

AutoDetectParser parser = new AutoDetectParser();
ParseContext context = getParseContext(extractionPolicy, parser);
Metadata metadata = new Metadata();
parser.parse(inputStream, handler, metadata, context);
return metadata;

Solution

  • I was going about it the wrong way. For progress reporting instead of estimating the output bytes and extraction size, I counted the read bytes on the input stream.

    Wrap the input string in CountingInputStream class provided by either AWS SDK or Apache Tika, and match the read bytes with total content bytes to get the percentage.

    CountingInputStream inputStream;
    Long totalContentLength;
    
    private int getProgressPercentage(){
        Long processedBytes = this.inputStream.getByteCount();
        if (0 < totalContentLength && processedBytes <= totalContentLength) {
            int percent = (int) ((processedBytes * 100.0 / totalContentLength));
            LOGGER.info("Processed bytes: {}, Total bytes: {}, Progress: {}%", processedBytes, totalContentLength, percent);
            return percent;
        }
        return 0;
    }