Search code examples
pandasdataframegroup-by

Pandas groupby and compare to previous group


I want to compare values, in this case names between dates I have in a dataframe. Assume following testframe:

import pandas as pd

# Create a sample DataFrame with 'Date' and 'Name' columns
data = {
    'Date': ['2023-10-03', '2023-10-03', '2023-10-02', '2023-10-02', '2023-10-01', '2023-10-01'],
    'Name': ['A', 'B', 'A', 'B', 'A', 'C']
}

df = pd.DataFrame(data)

I want to create a new column e.g. "Check" that displays a 0 for all dates, where the name composition compared to previous date did not change, and a 1 where it changed. The desired outcome would be:

   Date          Name   Check
0  2023-10-03    A      0      # A/B, as previously
1  2023-10-03    B      0
2  2023-10-02    A      1      # A/B, changed
3  2023-10-02    B      1
4  2023-10-01    A      1      # A/C
5  2023-10-01    C      1

For '2023-10-03' the value is 0 because the names are exactly like for '2023-10-02', namely 'A', and 'B'. For the other 2 dates the name composition either changed or it is the first date observation.

Thanks!


Solution

  • If I understand correctly, you can aggregate as set (with groupby.agg) and check if the composition changed with shift:

    # ensure datetime
    df['Date'] = pd.to_datetime(df['Date'])
    
    # get composition per Date
    s = df.groupby('Date')['Name'].agg(set)
    
    # check if composition is same as previous day
    df['Check'] = df['Date'].map(s.ne(s.shift(1, freq='D')).astype(int))
    

    Output:

            Date Name  Check
    0 2023-10-03    A      0
    1 2023-10-03    B      0
    2 2023-10-02    A      1
    3 2023-10-02    B      1
    4 2023-10-01    A      1
    5 2023-10-01    C      1
    

    Intermediate s:

    Date
    2023-10-01    {C, A}
    2023-10-02    {B, A}
    2023-10-03    {B, A}
    Name: Name, dtype: object
    

    If you want to consider the next date in terms of order of the groups and not by actual shift of 1 day:

    s = df.groupby('Date', sort=False)['Name'].agg(set)
    df['Check'] = df['Date'].map(s.ne(s.shift(-1)).astype(int))
    

    Or for the previous available date:

    df['Check'] = df['Date'].map(df.groupby('Date', sort=True)['Name']
                                   .agg(set).diff().ne(set()).astype(int))