Search code examples
javaout-of-memoryapache-tika

How to parse large text file with Apache Tika 1.5?


Problem: For my test, I want to extract text data from a 335 MB text file which is wikipedia's "pagecounts-20140701-060000.txt" with Apache Tika.

My solution: I tried to use TikaInputStream since it provides buffering, then I tried to use BufferedInputStream, but that didn't solve my problem. Here is the my test class below:

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

public class Printer {
    public void readMyFile(String fname) throws IOException, SAXException,
            TikaException {
        System.out.println("Working...");

        File f = new File(fname);
        // InputStream stream = TikaInputStream.get(new File(fname));
        InputStream stream = new BufferedInputStream(new FileInputStream(fname));

        Metadata meta = new Metadata();
        ContentHandler content = new BodyContentHandler(Integer.MAX_VALUE);
        AutoDetectParser parser = new AutoDetectParser();

        String mime = new Tika().detect(f);
        meta.set(Metadata.CONTENT_TYPE, mime);

        System.out.println("trying to parse...");
        try {
            parser.parse(stream, content, meta, new ParseContext());
        } finally {
            stream.close();
        }
    }

    public static void main(String[] args) {
        Printer p = new Printer();
        try {
            p.readMyFile("test/pagecounts-20140701-060000.txt");
        } catch (IOException | SAXException | TikaException e) {
            e.printStackTrace();
        }
    }
}

Problem: Upon invoking the parse method of the parser I am getting:

Working...
trying to parse...
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2367)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:535)
    at java.lang.StringBuffer.append(StringBuffer.java:322)
    at java.io.StringWriter.write(StringWriter.java:94)
    at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:92)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:135)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
    at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
    at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
    at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
    at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)
    at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:88)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at com.tastyminerals.cli.Printer.readMyFile(Printer.java:37)
    at com.tastyminerals.cli.Printer.main(Printer.java:46) 

I tried to increase jre memory consumption up to -Xms512M -Xmx1024M, that didn't work and I don't want to use any bigger values.

Questions: What is wrong with my code? How should I modify my class to make it extract text from a test file >300 MB with Apache Tika?


Solution

  • You can set like this to avoid the limit in size :-

    BodyContentHandler bodyHandler = new BodyContentHandler(-1);