I have a form set up with the following fields: Date Time, ID, and Address. This form auto assigns each entry a unique id string (U_ID) and then this data is later output to a csv with headers and rows something like this:
Date Time ID U_ID Address
9/12/13 12:07 13 adfasd 1345 Wilson Way
9/12/13 13:45 8 jklj 1456 Jackson Hollow
9/13/13 14:55 13 klidh 1345 Wilson Way
9/13/13 15:00 8 ikodl 1456 Jackson Hollow
I am looking for a way to remove duplicate submissions by those with the same ID via Python Script, while maintaining the rest of the data in the row. Ideally I want to keep only the first entry associated with the ID in the csv file.
The output should look something like this:
Date Time ID U_ID Address
9/12/13 12:07 13 adfasd 1345 Wilson Way
9/12/13 13:45 8 jklj 1456 Jackson Hollow
So far I'm stuck at:
import csv
with open('/Users/user/Desktop/test.csv', 'rb') as f:
r = csv.reader(f)
headers = r.next()
rows = [(Date Time, ID, U_ID, Address) for Date Time, ID, U_ID, Address in r]
clean = [row for row in rows if row[1] != '#N/A']
clean2 = list(set(row[1]))
This gives me a list with only the unique values for ID, but I'm not sure how to recover all of the other data associated with the rows for those values.
As stated if I can get the earliest submission as well, that would be wonderful, but honestly any unique submission by ID should do.
Thanks for reading!
Take a look at pandas, this is how you would do it:
import pandas as pd
pd.read_table('test.csv')\
.drop_duplicates(subset=['ID'])\
.to_csv('output.csv', index=None, sep='\t')
output.csv:
Date Time ID U_ID Address
9/12/13 12:07 13 adfasd 1345 Wilson Way
9/12/13 13:45 8 jklj 1456 Jackson Hollow