Search code examples
pythoncsvpython-3.xdeduplication

Python - Selecting All Row Values That Meet A particular Criteria Once


I have a form set up with the following fields: Date Time, ID, and Address. This form auto assigns each entry a unique id string (U_ID) and then this data is later output to a csv with headers and rows something like this:

Date Time       ID    U_ID     Address
9/12/13 12:07   13    adfasd   1345 Wilson Way
9/12/13 13:45   8     jklj     1456 Jackson Hollow
9/13/13 14:55   13    klidh    1345 Wilson Way
9/13/13 15:00   8     ikodl    1456 Jackson Hollow

I am looking for a way to remove duplicate submissions by those with the same ID via Python Script, while maintaining the rest of the data in the row. Ideally I want to keep only the first entry associated with the ID in the csv file.

The output should look something like this:

Date Time       ID    U_ID     Address
9/12/13 12:07   13    adfasd   1345 Wilson Way
9/12/13 13:45   8     jklj     1456 Jackson Hollow

So far I'm stuck at:

import csv

with open('/Users/user/Desktop/test.csv', 'rb') as f:
r = csv.reader(f)
headers = r.next()
rows = [(Date Time, ID, U_ID, Address) for Date Time, ID, U_ID, Address in r]
clean = [row for row in rows if row[1] != '#N/A']
clean2 = list(set(row[1]))

This gives me a list with only the unique values for ID, but I'm not sure how to recover all of the other data associated with the rows for those values.

As stated if I can get the earliest submission as well, that would be wonderful, but honestly any unique submission by ID should do.

Thanks for reading!


Solution

  • Take a look at pandas, this is how you would do it:

    import pandas as pd
    
    pd.read_table('test.csv')\
      .drop_duplicates(subset=['ID'])\
      .to_csv('output.csv', index=None, sep='\t')
    

    output.csv:

    Date    Time    ID  U_ID    Address
    9/12/13 12:07   13  adfasd  1345 Wilson Way
    9/12/13 13:45   8   jklj    1456 Jackson Hollow