Search code examples
javabufferedreaderfilewriter

Modify content of large file


I have extract my tables from my database in json file, now I want to read this files and remove all double quotes on them, seems easy and tried hundred of solutions, and some take me to the out of memory problems. I'm dealing with files that they have more than 1Gb size.The code that you will find below have a strange behaviour, and I don't understand why it return empty files

  public void replaceDoubleQuotes(String fileName){
    log.debug(" start formatting " + fileName + " ...");
    File firstFile = new File ("C:/sqlite/db/tables/" + fileName);
    String oldContent = "";
    String newContent = "";
    BufferedReader reader = null;
    BufferedWriter writer = null;
    FileWriter writerFile = null;
    String stringQuotes = "\\\\\\\\\"";
    try {
        reader = new BufferedReader(new FileReader(firstFile));
        writerFile = new FileWriter("C:/sqlite/db/tables/" + fileName);
        writer = new BufferedWriter(writerFile);
        
    while   (( oldContent = reader.readLine()) != null ){
        newContent = oldContent.replaceAll(stringQuotes, "");
        writer.write(newContent);
        }
    
    writer.flush();
    writer.close();
    } catch (Exception e) {
        log.error(e);
    }
}

and when I try to use FileWriter(path,true) to write at the end of the file the program don't stop increasing the file memory till the hard disk will be full, thanks for help

ps : I also tried to use subString and append the new content and after the while I write the subString but also doesn't work


Solution

  • TL; DR;

    Do not read and write the same file concurrently.

    The issue

    Your code starts reading, and then immediately truncates the file it is reading.

     reader = new BufferedReader(new FileReader(firstFile));
     writerFile = new FileWriter("C:/sqlite/db/tables/" + fileName);
     writer = new BufferedWriter(writerFile);
        
    

    The first line opens a read handle to the file. The second line opens a write handle to the same file. It is not very clear if you look at the documentation of FileWriter constructor, but when you do not use a constructor that allows you to specify the append parameter, then the value is false by default, meaning, you immediately truncate the file if it already exists.

    At this point (line 2) you have just erased the file you were about to read. So you end up with an empty file.

    What about using append=true

    Well, then the file is not erased when it is created, which is "good". So you program starts reading the first line, and outputs (to the same file) the filtered version.

    So each time a line is read, another is appended.

    No wonder your program will never reach the end of the file : each time it advances a line, it creates another line to process. Generally speaking, you'll never reach end of file (well of course if the file is a single line to begin with, you might but that's a corner case).

    The solution

    Write to a temporary file, and IF (and only IF) you succed, then swap the files if you really need too.

    An advantage of this solution : if for whatever reason your processe crahses, you'll have the original file untouched and you could retry later, which is usually a good thing. Your process is "repeatable".

    A disadvantage : you'll need twice the space at some point. (Although you could compress the temp file and reduce this factor but still).

    About out of memory issues

    When working with arbitrarily large files, the path you chose (using buffered readers and writers) is the right one, because you only use one line-worth of memory at a time.

    Therefore it generally avoids memory usage issues (unless of course, you have a file without line breaks, in which case it makes no difference at all).

    Other solutions, involving reading the whole file at once, then performing the search/replace in memory, then writing the contents back do not scale that well, so it's good you avoided this kind of computation.

    Not related but important

    Check out the try with resources syntax to properly close your resources (reader / writer). Here you forgot to close the reader, and you are not closing the writer appropriately anyway (that is : in a finally clause).

    Another thing : I'm pretty sure no java program written by a mere mortal will beat tools like sed or awk that are available on most unix platforms (and some more). Maybe you'd want to check if rolling your own in java is worth what is a shell one-liner.