Search code examples
javafilestreambufferedinputstream

How to read large files (a single continuous string) in Java?


I am trying to read a very large file (~2GB). Content is a continuous string with sentences (I would like to split them based on a '.'). No matter how I try, I end up with an Outofmemoryerror.

    BufferedReader in = new BufferedReader(new FileReader("a.txt"));
    String read = null;
    int i = 0;
    while((read = in.readLine())!=null) {
        String[] splitted = read.split("\\.");
        for (String part: splitted) {
            i+=1;
            users.add(new User(i,part));
            repository.saveAll(users);
        }
    }

also,

inputStream = new FileInputStream(path);
    sc = new Scanner(inputStream, "UTF-8");
    while (sc.hasNextLine()) {
        String line = sc.nextLine();
        // System.out.println(line);
    }
    // note that Scanner suppresses exceptions
    if (sc.ioException() != null) {
        throw sc.ioException();
    }

Content of the file (composed of random words with a full stop after 10 words):

fmfbqi .xcdqnjqln kvjhw pexrbunnr cgvrqlr fpaczdegnb puqzjdbp gcfxne jawml aaiwwmo ugzoxn .opjc fmfbqi .xcdqnjqln kvjhw pexrbunnr cgvrqlr fpaczdegnb puqzjdbp gcfxne jawml aaiwwmo ugzoxn .opjc  (so on)

Please help!


Solution

  • So first and foremost, based on comments on your question, as Joachim Sauer stated:

    If there are no newlines, then there is only a single line and thus only one line number.

    So your usecase is faulty, at best.

    Let's move past that, and assume maybe there are new line characters - or better yet, assume that the . character you're splitting on is intended to be a newline psudeo-replacement.

    Scanner is not a bad approach here, though there are others. Since you provided a Scanner, lets continue with that, but you want to make sure you're wrapping it around a BufferedReader. You clearly don't have a lot of memory, and a BufferedReader allows your to read 'chunks' of a file, as buffered by the BufferedReader, while utilizing the functionality of the Scanner completely obscure to you as a caller that the buffering is happening:

    Scanner sc = new Scanner(new BufferedReader(new FileReader(new File("a.txt")), 10*1024));
    

    What this is basically doing, is letting the Scanner function as you expect, but allowing you to buffer 10MB at a time, minimizing your memory footprint. Now, you just keep calling

    sc.useDelimiter("\\.");
    for(int i = 0; sc.hasNext(); i++) {
        String psudeoLine = sc.next();
        //store line 'i' in your database for this psudeo-line
        //DO NOT store psudeoLine anywhere else - you don't have memory for it
    }
    

    Since you don't have enough memory, the clear thing to iterate (and re-iterate) is don't store any part of the file within your JVM's heapspace after reading it. Read it, use it how you need it, and allow it to be marked for JVM garbage collection. In your case, you mention you want to store the psudeo lines in a database, so you want to read the psudeo-line, store it in the database, and just discard it.

    There are other things to point out here, such as configuring your JVM arguments, but I hesitate to even mention it because just setting your JVM memory high is a bad idea too - another brute force approach. There's nothing wrong with setting your JVM memory max heap size higher, but learning memory management is better if you're still learning how to write software. You'll get in less trouble later when you get into professional development.

    Also, I mentioned Scanner and BufferedReader because you mentioned that in your question, but I think checking out java.nio.file.Path.lines() as pointed out by deHaar is also a good idea. This basically does the same thing as the code I've explicitly laid out, with the caveat that it still only does 1 line at a time without the ability to change what you're 'splitting' on. So if your text file has 1 single line in it, this will still cause you a problem and you will still need something like a scanner to fragment the line out.