My dataframe looks like this:
A B C D E F G H I J
FP002 12 FP001 113 406 519 85 82 FP001 6240
FP003 7610 FP002 99 552 651 49 64 FP002 12294
FP005 12, FP003 102 131 1416 24 89 FP003 761
FP005 1250 FP004 94 739 833 122 215 FP004 400
I want my output to be like this:
A B C D E F G H I J
FP002 12 FP002 99 552 651 49 64 FP002 12294
FP003 7610 FP003 102 1314 1416 247 89 FP003 761
FP005 12,
FP005 1250
So basically retaining the rows following what is in Column A.
My code to start is this:
dfR = df1.join( df1 ,on=['A','C'], how='inner')
but it's not giving me the result i need.
You can split the chunks using groupby
on axis=1
, the join
with functools.reduce
after deduplicating with groupby.cumcount
:
from functools import reduce
# start groups based on first row whenever a value starts with "FP"
group = df.iloc[0].astype(str).str.startswith('FP').cumsum()
# [1, 1, 2, 2, 2, 2, 2, 2, 3, 3]
out = reduce(lambda a,b: a.join(b, how='left'),
(d.set_index([d.iloc[:, 0].to_numpy(),
d.groupby(d.iloc[:, 0]).cumcount().to_numpy()])
for k, d in df.groupby(group, axis=1))
)
Output:
A B C D E F G H I J
FP002 0 FP002 12 FP002 99.0 552.0 651.0 49.0 64.0 FP002 12294.0
FP003 0 FP003 7610 FP003 102.0 131.0 1416.0 24.0 89.0 FP003 761.0
FP005 0 FP005 12, NaN NaN NaN NaN NaN NaN NaN NaN
1 FP005 1250 NaN NaN NaN NaN NaN NaN NaN NaN
Note that groupby
on axis=1
will be deprecated in the future, the correct approach will then be:
from functools import reduce
group = df.iloc[0].astype(str).str.startswith('FP').cumsum()
out = reduce(lambda a,b: a.join(b, how='left'),
((d2:=d.T).set_index([d.iloc[0].to_numpy(),
d2.groupby(d.iloc[0]).cumcount().to_numpy()])
for k, d in df.T.groupby(group))
)
Intermediate that is passed to reduce
to perform the left join:
[ A B
FP002 0 FP002 12
FP003 0 FP003 7610
FP005 0 FP005 12,
1 FP005 1250,
C D E F G H
FP001 0 FP001 113 406 519 85 82
FP002 0 FP002 99 552 651 49 64
FP003 0 FP003 102 131 1416 24 89
FP004 0 FP004 94 739 833 122 215,
I J
FP001 0 FP001 6240
FP002 0 FP002 12294
FP003 0 FP003 761
FP004 0 FP004 400]
NB. you can also use other logics to form the groups, for instance to start groups whenever the data is not numeric:
group = pd.to_numeric(df.iloc[0], errors='coerce').isna().cumsum()
Or even fully manually:
group = [1, 1, 2, 2, 2, 2, 2, 2, 3, 3]
Or with an explicit list of indices:
group = df.columns.isin(['A', 'C', 'I']).cumsum()