Search code examples
javafilelinkedhashset

Merge the files into a new big file until the number of user id's become 10 Million


I have around 100 files in a folder. Each file will have data like this and each line resembles an user id.

960904056
6624084
1096552020
750160020
1776024
211592064
1044872088
166720020
1098616092
551384052
113184096
136704072

And I am trying to keep on merging the files from that folder into a new big file until the total number of user id's become 10 Million in that new big file.

I am able to read all the files from a particular folder and then I keep on adding the user id's from those files in a linkedhashset. And then I was thinking to see whether the size of hashset is 10 Million and if it is 10 million then write all those user id's to a new text file. Is that feasoible solution?

That 10 million number should be configurable. In future, If I need to change that 10 million 1o 50Million then I should be able to do that.

Below is the code I have so far

public static void main(String args[]) {

    File folder = new File("C:\\userids-20130501");
    File[] listOfFiles = folder.listFiles();

    Set<String> userIdSet = new LinkedHashSet<String>();
    for (int i = 0; i < listOfFiles.length; i++) {
        File file = listOfFiles[i];
        if (file.isFile() && file.getName().endsWith(".txt")) {
            try {
                List<String> content = FileUtils.readLines(file, Charset.forName("UTF-8"));
                userIdSet.addAll(content);
                if(userIdSet.size() >= 10Million) {
                    break;
                }
                System.out.println(userIdSet);
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

Any help will be appreciated on this? And any better way to do the same process?


Solution

  • Continuing from where we left. ;)

    You can use the FileUtils to write the file along with the writeLines() method.

    Try this -

    public static void main(String args[]) {
    
    File folder = new File("C:\\userids-20130501");
    
    Set<String> userIdSet = new LinkedHashSet<String>();
    int count = 1;
    for (File file : folder.listFiles()) {
        if (file.isFile() && file.getName().endsWith(".txt")) {
            try {
                List<String> content = FileUtils.readLines(file, Charset.forName("UTF-8"));
                userIdSet.addAll(content);
                if(userIdSet.size() >= 10Million) {
                    File bigFile = new File("<path>" + count + ".txt");
                    FileUtils.writeLines(bigFile, userIdSet);
                    count++;
                    userIdSet = new LinkedHashSet<String>(); 
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
      }
    }
    

    If the purpose of saving the data in the LinkedHashSet is just for writing it again to another file then I have another solution.

    EDIT to avoid OutOfMemory exception

    public static void main(String args[]) {
    File folder = new File("C:\\userids-20130501");
    
    int fileNameCount = 1;
    int contentCounter = 1;
    File bigFile = new File("<path>" + fileNameCount + ".txt");
    boolean isFileRequired = true;
    for (File file : folder.listFiles()) {
        if (file.isFile() && file.getName().endsWith(".txt")) {
            try {
                List<String> content = FileUtils.readLines(file, Charset.forName("UTF-8"));
                contentCounter += content.size();
                if(contentCounter < 10Million) {
                    FileUtils.writeLines(bigFile, content, true);
                } else {
                    fileNameCount++;
                    bigFile = new File("<path>" + fileNameCount + ".txt");
                    FileUtils.writeLines(bigFile, content);
                    contentCounter = 1;
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
      }
    }