Search code examples
python-2.7csvpython-3.xcmp

Delete Duplicate records in CSV file in python 2.7


My INPUT file:

1,boss,30
2,go,35
2,nan,45
3,fog,33
4,kd,55
4,gh,56

Output file should be:

1,boss,30
3,fog,33

Means my output file should be free from duplicates. I should delete the record which is repeating based on the column 1.

Code I tried:

source_rd = csv.writer(open("Non_duplicate_source.csv", "wb"),delimiter=d)
gok = set()
for rowdups in sort_src:
    if rowdups[0] not in gok:
        source_rd.writerow(rowdups)
        gok.add( rowdups[0])

Output I got:

1,boss,30
2,go,35
3,fog,33
4,kd,55

What am I doing wrong?


Solution

  • You can just loop the file twice.

    The first time through, count all the duplicates. Second time through fetch the ones of interest.

    import csv
    
    gok={}
    with open(fn) as fin:
        reader=csv.reader(fin)
        for e in reader:
            gok[e[0]]=gok.setdefault(e[0], 0)+1
    
    with open(fn) as fin:
        reader=csv.reader(fin)
        for e in reader:
            if gok[e[0]]==1:
                print e
    

    Prints:

    ['1', 'boss', '30']
    ['3', 'fog', '33']
    

    The reason your method does not work is that once the second instance of the duplicate is seen, the first has already been written.