Search code examples
javafilesortingexternal-sorting

How to merge files based on key value pairs and also sort it using Java?


I have written a program similar to external sort. I got a good idea from this blog. Here, they are trying to do external sort of only numbers. My requirement is little different. My input file can have over million records and it is difficult to sort them in-memory, so I have to make use of my disk. I divide my input into various slices, sort it and then store it in temporary files. And then merge sorted output into a single file. Below I am able to split it into temporary files and then merge only the keys.

I have an input file as below:

key1 abc
key2 world
key1 hello
key3 tom
key7 yankie
key3 apple
key5 action
key7 jack
key4 apple
key2 xon
key1 lemon

Let's say the size of the file on disk is 10 and maximum items memory buffer can hold is 4, so what I have done is taken 4 records at a time and stored it in HashMap, sorted my values along with the updated count. This input will be split into 3 sorted files as shown below. You can see, for each key I have a count and also the lexicographically highest value.

temp-file-0.txt

key1: 2, hello
key2: 1, world
key3: 1, tom

temp-file-1.txt

key5: 1, action
key3: 1, apple
key7: 2, yankie

temp-file-2.txt

key1: 1, lemon
key2: 1, xon
key4: 1, apple

then after merging all these 3 files, output should look like this:

key1: 3 lemon
key2: 2 xon
key3: 2 world
key5: 1 action
key7: 2 yankie

I am not sure about the logic of merging the entire line along with count and the lexicographically highest value for that key, my below code is able to give me all keys, something like this:

key1
key1
key2
key2
key3
key4
key5
key3
key7

In below code, I am opening each file and merging them, then writing back to disk into a new single file called external-sorted.txt

    static int N = 10; // size of the file in disk
     static int M = 4; // max items the memory buffer can hold
     int slices = (int) Math.ceil((double) N/M);
     String tfile = "temp-file-";

//Reading all the 3 temp files
     BufferedReader[] brs = new BufferedReader[slices];
     String[] topNums = new String[slices];
     for(i = 0; i<slices; i++){
      brs[i] = new BufferedReader(new FileReader(tfile + Integer.toString(i) + ".txt"));
      String t = brs[i].readLine();
      String[] kv = t.split(":");

      if(t!=null){
        topNums[i] = kv[0];
      }
    //topNums [key1, key5, key1]
     }

    FileWriter fw = new FileWriter("external-sorted.txt");
    PrintWriter pw = new PrintWriter(fw);

    for(i=0; i<N; i++){

    String min = topNums[0];
    System.out.println("min:"+min);
    int minFile = 0;

    for(j=0; j<slices; j++){

    if(min.compareTo(topNums[j])>0)
      {
      min = topNums[j];
      minFile = j;
      }
    }

     pw.println(min);
      String t = brs[minFile].readLine();
     String[] kv = new String[2];
      if (t != null)
         kv = t.split(":");
         topNums[minFile] = kv[0];

    }

       for (i = 0; i < slices; i++)
        brs[i].close();

       pw.close();
       fw.close();
      }

Any ideas are appreciated. Please ask, if you have any questions. TIA.


Solution

  • Well, something like this works, I'm sure there are better ways, but at the moment I'm not up to thinking really:

        // Declare Scanner Object to read our file
        Scanner in = new Scanner(new File(stringRepresentingLocationOfYourFileHere));
    
        // create Map that will contain keys in sorted order (TreeMap)
        // along with last value assigned to the key
        Map<String, String> mapa = new TreeMap<>();
    
        // another map to hold keys from first map and number of
        // occurrences of those keys (repetitions), this could have been
        // done using single Map as well, but whatever
        Map<String, Integer> mapaDva = new HashMap<>();
    
        // String array that will hold words of each line of our .txt file
        String[] line;
    
        // we loop until we reach end of our .txt file
        while(in.hasNextLine()){
    
            // check if map already contains given key, if it does
            // increment value by 1 otherwise initialize the value with 1
            if (mapa.put((line = in.nextLine().split(" "))[0], line[1]) != null)
                mapaDva.put(line[0], mapaDva.get(line[0])+1);
            else
                mapaDva.put(line[0], 1);
        }
    
        // loop through our maps and print out keys, number of 
        //repetitions, last assigned value
        for (Map.Entry<String, String> m : mapa.entrySet()){
            System.out.println(m.getKey() + " " + mapaDva.get(m.getKey()) + " " + m.getValue());
        }
    

    If there is something specific that is unclear about this code, please ask.

    Sample input file:

    key1 abcd
    key2 zzz
    key1 tommy
    key3 world
    

    Output when done:

    key1 2 tommy
    key2 1 zzz
    key3 1 world
    

    EDIT 2 (solution for when dealing with multiple files):

     // array of File objects that hold path to all your files to iterate through
        File[] files = {new File("file1.txt"),
                        new File("file2.txt"),
                        new File("file3.txt")};
    
        Scanner in;
        Map<String, String> mapa = new TreeMap<>();
        Map<String, Integer> mapaDva = new HashMap<>();
        String[] line;
    
        for (int i = 0; i < files.length; i++) {
            // assign new File to Scanner on each iteration (go through our File array)
            in = new Scanner(files[i]);
            while(in.hasNextLine()){
                if (mapa.put((line = in.nextLine().split(" "))[0], line[1]) != null)
                    mapaDva.put(line[0], mapaDva.get(line[0])+1);
                else
                    mapaDva.put(line[0], 1);
            }
        }
    
    
    
        for (Map.Entry<String, String> m : mapa.entrySet()){
            System.out.println(m.getKey() + " " + mapaDva.get(m.getKey()) + " " + m.getValue());
        }
    

    So we store all the File objects in our File array and we go through each and every one of them, combine all the content and print out final result:

    3 sample input files:

    file1.txt

    key1 abcd
    key2 zzz
    key1 tommy
    key3 world
    

    file2.txt

    key1 abc
    key3 xxx
    key1 tommy
    key6 denver
    

    file3.txt

    key5 lol
    key8 head
    key6 tommy
    key6 denver
    

    OUTPUT:

    key1 4 tommy
    key2 1 zzz
    key3 2 xxx
    key5 1 lol
    key6 3 denver
    key8 1 head