I have 2 dataframes with the following structures:
df1
Group1 Group2 Label
G1 A1 AA
G1 A1 BB
G1 A1 CC
G1 A2 AA
G1 A2 CC
G2 A1 BB
G2 A1 DD
G2 A2 AA
G2 A2 CC
G2 A2 DD
G2 A2 BB
df2
ID Label_ref
1 AA
2 BB
4 CC
5 DD
7 EE
I want to group the df1
based on the Group1
and Group2
columns and check if the 'Label' column contains values from df2
Label_ref
in order of ID
.
Label
on df1
doesn't need to have all values from Label_ref
on df2
, but the values Label
on df1
can't skip any Label_ref
values in the order of ID
Expected output:
The group Group1=G1
, Group2=A1
doesn't skip any values from AA
- CC
. Therefore the rows corresponding to the group are flagged.
The group Group1=G1
, Group2=A2
skips values from BB
but has the value CC
. Therefore the rows corresponding to the group are not flagged.
The group Group1=G2
, Group2=A2
doesn't skip any values from AA
- DD
although they are not in order. Therefore the rows corresponding to the group are flagged.
Group1 Group2 Label Flag
G1 A1 AA 1
G1 A1 BB 1
G1 A1 CC 1
G1 A2 AA 0
G1 A2 CC 0
G2 A1 BB 0
G2 A1 DD 0
G2 A2 AA 1
G2 A2 CC 1
G2 A2 DD 1
G2 A2 BB 1
I haven't been able to make much progress:
import pandas as pd
df1 = pd.DataFrame({
'Group1': [ 'G1','G1', 'G1','G1','G1',
'G2','G2', 'G2','G2','G2','G2'],
'Group2': ['A1','A1','A1','A2','A2',
'A1','A1','A2','A2','A2','A2'],
'Label': ['AA','BB','CC','AA','CC','BB',
'DD','AA','CC','DD','BB']})
df2 = pd.DataFrame({
'ID': [ 1, 2, 4, 5, 7],
'Label_ref': ['AA','BB','CC','DD','EE']})
A link to a solution or a function/method I can use to achieve this is appreciated
You can use a mapping series:
# Create mapping series
smap = pd.Series(*pd.factorize(df2.sort_values('ID')['Label_ref']))
df1['Flag'] = df1['Label'].map(smap)
# Check if flags are consecutive for each group
is_consecutive = lambda x: (~x.diff().gt(1).any()).astype(int)
df1['Flag'] = df1.sort_values('Flag').groupby(['Group1', 'Group2'])['Flag'].transform(is_consecutive)
Output:
>>> df1
Group1 Group2 Label Flag
0 G1 A1 AA 1
1 G1 A1 BB 1
2 G1 A1 CC 1
3 G1 A2 AA 0
4 G1 A2 CC 0
5 G2 A1 BB 0
6 G2 A1 DD 0
7 G2 A2 AA 1
8 G2 A2 CC 1
9 G2 A2 DD 1
10 G2 A2 BB 1