Search code examples
string-comparisonopencsvfile-processing

how to process big file with comparison of each line in that file with remaining all lines in same file?


I have csv file with 5,00,000 records in it. Fields in csv file are as follows

No, Name, Address

Now i want to compare name and address from each record with name and address of all remaining records.
I was doing it in following way

List<String> lines = new ArrayList<>();
BufferedReader firstbufferedReader = new BufferedReader(new FileReader(newFile(pathname)));
while ((line = firstbufferedReader.readLine()) != null) {
            lines.add(line);
        }
        firstbufferedReader.close();
        for (int i = 0; i < lines.size(); i++) 
        {
            csvReader = new CSVReader(new StringReader(lines.get(i)));
            csvReader = null;
            for (int j = i + 1; j < lines.size(); j++) 
            {

                    csvReader = new CSVReader(new StringReader(lines.get(j)));
                    csvReader = null;
                    application.linesToCompare(lines.get(i),lines.get(j));
            }
        }

linesToCompare Function will extract name and address from respective parameters and do comaprison. If i found records to be 80% matching(based on name and address) i am marking them as duplicates.
But my this approach is taking too much time to process that csv file.
I want a faster approach may be some kind of map reduce or anything.
Thanks in advance


Solution

  • It is taking a long time because it looks like you are reading the file a huge amount of times.

    You first read the file into the lines List, then for every entry you read it again, then inside that you read it again!. Instead of doing this, read the file once into your lines array and then use that to compare the entries against each other.

    Something like this might work for you:

    List<String> lines = new ArrayList<>();
    BufferedReader firstbufferedReader = new BufferedReader(new FileReader(newFile(pathname)));
    while ((line = firstbufferedReader.readLine()) != null) {
        lines.add(line);
    }
    firstbufferedReader.close();
    for (int i = 0; i < lines.size(); i++) 
    {
        for (int j = i + 1; j < lines.size(); j++) 
        {
            application.linesToCompare(lines.get(i),lines.get(j));
        }
    }