Search code examples
pythoncsvduplicates

How to remove duplicated rows in a CSV file based on a column


I basically want to remove all rows with duplicated cells in the second column in a CSV file:

Skufnoo,222228888444,-6026769894509215039,ВупÑень пупÑень â¤ï¸â€ðŸ©¹ðŸ’—,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,4,True,False,0
mAtkmb,5213786988,4161254730445748607,ДаниÑль Блинов,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,False,False,False,0
Ethan58,222228888444,7737583697013043644,Ethan,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,4,True,False,0
sheluvjoseph,1421438213,8544915453690665435,អន សំអុល,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,5,True,False,0

and write them to a new CSV file like this:

Skufnoo,222228888444,-6026769894509215039,ВупÑень пупÑень â¤ï¸â€ðŸ©¹ðŸ’—,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,4,True,False,0
mAtkmb,5213786988,4161254730445748607,ДаниÑль Блинов,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,False,False,False,0
sheluvjoseph,1421438213,8544915453690665435,អន សំអុល,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,5,True,False,0

I have tried the following code, but it doesn't work:

import csv

with (
    open('members.csv', 'r', encoding="utf8") as in_file,
    open('members2.csv', 'w', encoding="utf8") as out_file,
):
    writer=csv.writer(out_file)
    tracks = set()
    for row in in_file:
        key = row[1]
        if key not in tracks:
            writer.writerow(row)
            tracks.add(key)

Any help is very appreciated.


Solution

  • You forgot to read the input csv file with csv.reader

    in_data = csv.reader(in_file, delimiter=',')
    

    Every other lines in your code seems ok.

    Complete code:

    import csv
    
    with open('members.csv', 'r', encoding="utf8") as in_file, open('members2.csv', 'w', encoding="utf8") as out_file:
        in_data = csv.reader(in_file, delimiter=',')
    
        writer=csv.writer(out_file)
    
        tracks = set()
    
        for row in in_data:
            key = row[1]
            if key not in tracks:
                writer.writerow(row)
                tracks.add(key)