Search code examples
javafilebufferedreader

External Sorting from files in Java


I am wondering how do we write Java code from the following PseudoCode

 foreach file F in file directory D
        foreach int I in file F
               sort all I from each file

Basically this is part of the External Sorting algorithm, so those files contain lists of sorted integer, and I want to read the first one from each file and sort it and then output to another file, and then move to the next integer from each file again until all the integers are fully sorted.
The problem is that as far as I understand for each file we need a reader, so if we have N files then does that mean we need N file readers?

======update=======

I am wondering is it something that look like this? Correct me if I miss anything or any other better approach.

int numOfFiles = 10;
Scanner [] scanners = new Scanner[numOfFiles];
try{
    //reader all the files
    for(int i = 0 ; i < numOfFiles; i++){
        scanners[i] = new Scanner(new BufferedReader(
            new FileReader("file"+i+".txt");
    }
}
catch(FileNotFoundException fnfe){

}

Solution

  • The problem is that as far as I understand for each file we need a reader, so if we have N files then does that mean we need N file readers ?

    Yes, that's right - unless you want to either have to go back over the data, or the whole of each file into memory. Either of those would let you get away with only one file open at a time - but that may well not suit what you want to do.

    Operating systems usually only allow you to open a certain number of files at a time. If you're trying to do something like create a single sorted set of results from a very large number of files, you might want to consider operating on a few of them at a time, producing larger intermediate files. At its simplest, this would just sort two files at a time, e.g.

    input1 + input2 => tmp-a1
    input3 + input4 => tmp-a2
    input5 + input6 => tmp-a3
    input7 + input8 => tmp-a4
    
    tmp-a1 + tmp-a2 => tmp-b1
    tmp-a3 + tmp-a4 => tmp-b2
    
    tmp-b1 + tmp-b2 => result