Search code examples
pythonpandasdataframegroup-bycrosstab

Group by unique Name and Status with the last Date


I would like to analyze statistics per cars which were repairs and which are new. Data sample is:

Name   IsItNew    ControlDate
Car1    True      31/01/2018
Car2    True      28/02/2018
Car1    False     15/03/2018
Car2    True      16/04/2018
Car3    True      30/04/2018
Car2    False     25/05/2018
Car1    False     30/05/2018    

So, I should groupby by Name and if there is a False in IsItNew column I should set False and the first date, when False was happened.

I tried groupby with nunique():

df = df.groupby(['Name','IsItNew', 'ControlDate' ])['Name'].nunique()

But, it returns count of unique items in each group.

How can I receive only grouped unique items without any count?

Actual result is:

Name   IsItNew       ControlDate
Car1    True         31/01/2018     1
        False        15/03/2018     1
                     30/05/2018     1
Car2    True         28/02/2018     1
                     16/04/2018     1
        False        25/05/2018     1 
Car3    True         30/04/2018     1


Expected Result is:

Name   IsItNew     ControlDate
Car1    False      15/03/2018
Car2    False      25/05/2018
Car3    True       30/04/2018

I'd appreciate for any idea. Thanks)


Solution

  • First convert column to datetimes by to_datetime, then sorting by 3 columns with DataFrame.sort_values and last get first row by column Names by DataFrame.drop_duplicates:

    df['ControlDate'] = pd.to_datetime(df['ControlDate'])
    
    df = df.sort_values(['Name','IsItNew', 'ControlDate']).drop_duplicates('Name')
    
    print (df)
       Name  IsItNew ControlDate
    2  Car1    False  2018-03-15
    5  Car2    False  2018-05-25
    4  Car3     True  2018-04-30
    

    EDIT:

    print (df)
       Name  IsItNew ControlDate
    0  Car1     True  31/01/2018
    1  Car2     True  28/02/2018
    2  Car1    False  15/03/2018
    3  Car2     True  16/04/2018
    4  Car3     True  30/04/2018
    5  Car2    False  25/05/2018
    6  Car1    False  30/05/2018
    7  Car3     True  20/10/2019
    8  Car3     True  30/04/2017
    
    #set to datetimes
    df['ControlDate'] = pd.to_datetime(df['ControlDate'])
    #sorting by 3 columns
    df1 = df.sort_values(['Name','IsItNew', 'ControlDate'])
    
    #create Series for replace
    s = df1.drop_duplicates('Name', keep='last').set_index('Name')['ControlDate']
    
    #filter by Falses
    df2 = df1.drop_duplicates('Name').copy()
    #replace True rows by last timestamp
    df2.loc[df2['IsItNew'], 'ControlDate'] = df2.loc[df2['IsItNew'], 'Name'].map(s)
    print (df2)
       Name  IsItNew ControlDate
    2  Car1    False  2018-03-15
    5  Car2    False  2018-05-25
    8  Car3     True  2019-10-20