Search code examples
javaparsingfile-ioplaintext

Parsing a huge plain text file


I have a huge text file (207 MB, 4 million lines) and I need to read it sequentially line by line.
Every line has this format:
20227993821NAME AND SURNAME NINIC NN08
I was using (for regular files) the Java library's FileReader and BufferedReader like this:

FileReader dataFile = new FileReader(directory);
data = new BufferedReader(dataFile);
String s;
while((s = data.readLine()) != null){
    //do stuff
}

with no problems, but with huge files it takes too much time to process.
I wonder what would be the best practice in such cases (another library, different methods, etc.), anything would be helpfull.
The file is issued periodically by a government agency and it must be loaded in to my software for data comparison.

Edit:

This code:

BufferedReader data = new BufferedReader(new FileReader(file));
String s;
int count = 0;
while ((s = data.readLine()) != null) {
   System.out.println (count + " - " + s);
   count++;
}
data.close();

executed in 19 minutes 30 seconds. I don't know why it took so long.
I have a 64 bit operative system and a i5 processor.


Solution

  • If I run

    File file = new File("/tmp/deleteme.txt");
    file.deleteOnExit();
    
    long start = System.nanoTime();
    PrintWriter pw = new PrintWriter(file);
    for (int i = 0; i < 4 * 1000 * 1000; i++)
        pw.println("01234567890123456789012345678901234567890123456789");
    pw.close();
    
    long mid = System.nanoTime();
    BufferedReader data = new BufferedReader(new FileReader(file));
    String s;
    while ((s = data.readLine()) != null) {
        //do stuff
    }
    data.close();
    long end = System.nanoTime();
    
    System.out.printf("Took %.3f seconds to write and %.3f seconds to read a %.2f MB file.%n",
            (mid - start) / 1e9, (end - mid) / 1e9, file.length() / 1e6);
    

    it prints

    Took 0.465 seconds to write and 0.522 seconds to read a 204.00 MB file.
    

    EDIT: If I print out each line, it slows down dramatically because writing to the screen take a long time. I have found the MS-DOS window to be especially slow.

    Took 0.467 seconds to write and 10.254 second to read a 204.00 MB file.
    

    I don't believe its the reading of the file which is taking too long, it is what you are doing with it that is taking a long time.