I am trying to clean up my loaded CSV file using multiple columns values, so I can filter duplicate records out of it and hopefully drop them out, But I get and error related to date:
My sample data is:
ACTIVITY_DATE | OWNNER_ID | OWNER_NAME |
---|---|---|
1/1/2020 | 23344 | JAMES NELSON |
2/1/2020 | 33445 | NIGEL THOMAS |
1/1/2020 | 23344 | JAMES NELSON |
2/1/2020 | 33445 | NIGEL THOMAS |
My code is:
inspections = inspections[inspections.duplicated(subset=['ACTIVITY_DATE','OWNER_ID'], keep=False)]
My error is:
KeyError: 'ACTIVITY_DATE'
÷ntended output
ACTIVITY_DATE | OWNNER_ID | OWNER_NAME |
---|---|---|
1/1/2020 | 23344 | JAMES NELSON |
2/1/2020 | 33445 | NIGEL THOMAS |
In your particular example, the .drop_duplicates() should be just fine?
In [71]: df
Out[71]:
ACTIVITY_DATE OWNNER_ID OWNER_NAME
0 1/1/2020 23344 JAMES NELSON
1 2/1/2020 33445 NIGEL THOMAS
2 1/1/2020 23344 JAMES NELSON
3 2/1/2020 33445 NIGEL THOMAS
In [72]: df.drop_duplicates()
Out[72]:
ACTIVITY_DATE OWNNER_ID OWNER_NAME
0 1/1/2020 23344 JAMES NELSON
1 2/1/2020 33445 NIGEL THOMAS
Alternatively, if you want to provide only a subset of columns to be compared, you can do it this way:
In [77]: df.drop_duplicates(subset=['ACTIVITY_DATE','OWNNER_ID'])
Out[77]:
ACTIVITY_DATE OWNNER_ID OWNER_NAME
0 1/1/2020 23344 JAMES NELSON
1 2/1/2020 33445 NIGEL THOMAS