Search code examples
pythonpandasdataframedatedata-cleaning

Data deduplication using python


I am trying to clean up my loaded CSV file using multiple columns values, so I can filter duplicate records out of it and hopefully drop them out, But I get and error related to date:

My sample data is:

ACTIVITY_DATE OWNNER_ID OWNER_NAME
1/1/2020 23344 JAMES NELSON
2/1/2020 33445 NIGEL THOMAS
1/1/2020 23344 JAMES NELSON
2/1/2020 33445 NIGEL THOMAS

My code is:

inspections = inspections[inspections.duplicated(subset=['ACTIVITY_DATE','OWNER_ID'], keep=False)]

My error is:

KeyError: 'ACTIVITY_DATE'

÷ntended output

ACTIVITY_DATE OWNNER_ID OWNER_NAME
1/1/2020 23344 JAMES NELSON
2/1/2020 33445 NIGEL THOMAS

Solution

  • In your particular example, the .drop_duplicates() should be just fine?

    In [71]: df
    Out[71]:
      ACTIVITY_DATE  OWNNER_ID    OWNER_NAME
    0      1/1/2020      23344  JAMES NELSON
    1      2/1/2020      33445  NIGEL THOMAS
    2      1/1/2020      23344  JAMES NELSON
    3      2/1/2020      33445  NIGEL THOMAS
    
    In [72]: df.drop_duplicates()
    Out[72]:
      ACTIVITY_DATE  OWNNER_ID    OWNER_NAME
    0      1/1/2020      23344  JAMES NELSON
    1      2/1/2020      33445  NIGEL THOMAS
    

    Alternatively, if you want to provide only a subset of columns to be compared, you can do it this way:

    In [77]: df.drop_duplicates(subset=['ACTIVITY_DATE','OWNNER_ID'])
    Out[77]:
      ACTIVITY_DATE  OWNNER_ID    OWNER_NAME
    0      1/1/2020      23344  JAMES NELSON
    1      2/1/2020      33445  NIGEL THOMAS