python pandas dataframe date data-cleaning

Data deduplication using python

I am trying to clean up my loaded CSV file using multiple columns values, so I can filter duplicate records out of it and hopefully drop them out, But I get and error related to date:

My sample data is:

ACTIVITY_DATE	OWNNER_ID	OWNER_NAME
1/1/2020	23344	JAMES NELSON
2/1/2020	33445	NIGEL THOMAS
1/1/2020	23344	JAMES NELSON
2/1/2020	33445	NIGEL THOMAS

My code is:

inspections = inspections[inspections.duplicated(subset=['ACTIVITY_DATE','OWNER_ID'], keep=False)]

My error is:

KeyError: 'ACTIVITY_DATE'

÷ntended output

ACTIVITY_DATE	OWNNER_ID	OWNER_NAME
1/1/2020	23344	JAMES NELSON
2/1/2020	33445	NIGEL THOMAS

Solution

In your particular example, the .drop_duplicates() should be just fine?

In [71]: df
Out[71]:
  ACTIVITY_DATE  OWNNER_ID    OWNER_NAME
0      1/1/2020      23344  JAMES NELSON
1      2/1/2020      33445  NIGEL THOMAS
2      1/1/2020      23344  JAMES NELSON
3      2/1/2020      33445  NIGEL THOMAS

In [72]: df.drop_duplicates()
Out[72]:
  ACTIVITY_DATE  OWNNER_ID    OWNER_NAME
0      1/1/2020      23344  JAMES NELSON
1      2/1/2020      33445  NIGEL THOMAS

Alternatively, if you want to provide only a subset of columns to be compared, you can do it this way:

In [77]: df.drop_duplicates(subset=['ACTIVITY_DATE','OWNNER_ID'])
Out[77]:
  ACTIVITY_DATE  OWNNER_ID    OWNER_NAME
0      1/1/2020      23344  JAMES NELSON
1      2/1/2020      33445  NIGEL THOMAS