Search code examples
pythonpandasgroup-by

Python group by columns and make sure values in group doesn't skip values in order of another dataframe


I have 2 dataframes with the following structures:

df1
Group1 Group2 Label
G1     A1    AA
G1     A1    BB
G1     A1    CC
G1     A2    AA
G1     A2    CC
G2     A1    BB
G2     A1    DD
G2     A2    AA
G2     A2    CC
G2     A2    DD
G2     A2    BB

df2
ID  Label_ref
1     AA
2     BB
4     CC
5     DD
7     EE

I want to group the df1 based on the Group1 and Group2 columns and check if the 'Label' column contains values from df2 Label_ref in order of ID.

Label on df1 doesn't need to have all values from Label_ref on df2, but the values Label on df1 can't skip any Label_ref values in the order of ID

Expected output:

The group Group1=G1, Group2=A1 doesn't skip any values from AA - CC. Therefore the rows corresponding to the group are flagged.

The group Group1=G1, Group2=A2 skips values from BB but has the value CC. Therefore the rows corresponding to the group are not flagged.

The group Group1=G2, Group2=A2 doesn't skip any values from AA - DD although they are not in order. Therefore the rows corresponding to the group are flagged.

Group1 Group2 Label  Flag
G1     A1    AA     1
G1     A1    BB     1
G1     A1    CC     1
G1     A2    AA     0
G1     A2    CC     0
G2     A1    BB     0
G2     A1    DD     0
G2     A2    AA     1
G2     A2    CC     1
G2     A2    DD     1
G2     A2    BB     1

I haven't been able to make much progress:

import pandas as pd
df1 = pd.DataFrame({
                'Group1': [ 'G1','G1', 'G1','G1','G1',
                            'G2','G2', 'G2','G2','G2','G2'],
                'Group2': ['A1','A1','A1','A2','A2',
                            'A1','A1','A2','A2','A2','A2'],
                'Label': ['AA','BB','CC','AA','CC','BB',
                            'DD','AA','CC','DD','BB']})
df2 = pd.DataFrame({
                'ID': [ 1, 2, 4, 5, 7],
                'Label_ref': ['AA','BB','CC','DD','EE']})

A link to a solution or a function/method I can use to achieve this is appreciated


Solution

  • You can use a mapping series:

    # Create mapping series
    smap = pd.Series(*pd.factorize(df2.sort_values('ID')['Label_ref']))
    df1['Flag'] = df1['Label'].map(smap)
    
    # Check if flags are consecutive for each group
    is_consecutive = lambda x: (~x.diff().gt(1).any()).astype(int)
    df1['Flag'] = df1.sort_values('Flag').groupby(['Group1', 'Group2'])['Flag'].transform(is_consecutive)
    

    Output:

    >>> df1
       Group1 Group2 Label  Flag
    0      G1     A1    AA     1
    1      G1     A1    BB     1
    2      G1     A1    CC     1
    3      G1     A2    AA     0
    4      G1     A2    CC     0
    5      G2     A1    BB     0
    6      G2     A1    DD     0
    7      G2     A2    AA     1
    8      G2     A2    CC     1
    9      G2     A2    DD     1
    10     G2     A2    BB     1