Search code examples
pythonpandasjoinmerge

Deleting the rows in COLUMNS that do not match the rows in another column (all belonging to a 1 dataframe)


My dataframe looks like this:

A         B    C    D        E   F   G  H       I      J 
FP002    12     FP001   113 406 519  85 82  FP001   6240
FP003    7610   FP002   99  552 651  49 64  FP002   12294
FP005    12,    FP003   102 131 1416 24  89 FP003   761
FP005    1250   FP004   94  739 833 122 215 FP004   400

I want my output to be like this:

A         B    C         D   E       F       G       H       I       J
FP002    12     FP002   99  552     651      49      64   FP002   12294
FP003    7610   FP003   102 1314    1416    247      89   FP003  761
FP005    12,    
FP005    1250   

So basically retaining the rows following what is in Column A.

My code to start is this:

dfR = df1.join( df1 ,on=['A','C'], how='inner') 

but it's not giving me the result i need.


Solution

  • You can split the chunks using groupby on axis=1, the join with functools.reduce after deduplicating with groupby.cumcount:

    from functools import reduce
    
    # start groups based on first row whenever a value starts with "FP"
    group = df.iloc[0].astype(str).str.startswith('FP').cumsum()
    # [1, 1, 2, 2, 2, 2, 2, 2, 3, 3]
    
    out = reduce(lambda a,b: a.join(b, how='left'), 
                 (d.set_index([d.iloc[:, 0].to_numpy(),
                               d.groupby(d.iloc[:, 0]).cumcount().to_numpy()])
                  for k, d in df.groupby(group, axis=1))
                )
    

    Output:

                 A     B      C      D      E       F     G     H      I        J
    FP002 0  FP002    12  FP002   99.0  552.0   651.0  49.0  64.0  FP002  12294.0
    FP003 0  FP003  7610  FP003  102.0  131.0  1416.0  24.0  89.0  FP003    761.0
    FP005 0  FP005   12,    NaN    NaN    NaN     NaN   NaN   NaN    NaN      NaN
          1  FP005  1250    NaN    NaN    NaN     NaN   NaN   NaN    NaN      NaN
    

    Note that groupby on axis=1 will be deprecated in the future, the correct approach will then be:

    from functools import reduce
    
    group = df.iloc[0].astype(str).str.startswith('FP').cumsum()
    
    out = reduce(lambda a,b: a.join(b, how='left'), 
                 ((d2:=d.T).set_index([d.iloc[0].to_numpy(),
                                       d2.groupby(d.iloc[0]).cumcount().to_numpy()])
                  for k, d in df.T.groupby(group))
                )
    

    Intermediate that is passed to reduce to perform the left join:

    [             A     B
     FP002 0  FP002    12
     FP003 0  FP003  7610
     FP005 0  FP005   12,
           1  FP005  1250,
                  C    D    E     F    G    H
     FP001 0  FP001  113  406   519   85   82
     FP002 0  FP002   99  552   651   49   64
     FP003 0  FP003  102  131  1416   24   89
     FP004 0  FP004   94  739   833  122  215,
                  I      J
     FP001 0  FP001   6240
     FP002 0  FP002  12294
     FP003 0  FP003    761
     FP004 0  FP004    400]
    

    NB. you can also use other logics to form the groups, for instance to start groups whenever the data is not numeric:

    group = pd.to_numeric(df.iloc[0], errors='coerce').isna().cumsum()
    

    Or even fully manually:

    group = [1, 1, 2, 2, 2, 2, 2, 2, 3, 3]
    

    Or with an explicit list of indices:

    group = df.columns.isin(['A', 'C', 'I']).cumsum()