Search code examples
pandasrow

pandas select rows and cell values based on column and other conditions


How to do the following in pandas-

df = pd.DataFrame({
    'ColA': [1, 2, 3, 11111, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9],
    'ColB': [11, 5, 22, 66, 4, 33333, 45, 91, 78, 10, 17, 55, 73, 85, 56, 99, 4, 74],
    'ColC': ['A', 'B', 'X', 'C', 'D', 'X', 'E', 'F', 'G', 'A', 'B', 'X', 'C', 'D', 'X', 'E', 'F', 'G'],
    'ColD': [0,0,0,0,0, 22222,0,0,0,0,0,48,0,0,52,0,0,0]
})

Disregard all rows before 1st and after last occurrence of X in ColC

search for 1st occurrence of X in ColC, select one row below, take value (say, Y) from ColA from such selected row. (in df it is 11111)

search for 2nd occurrence of X in ColC, take value (say, Z) from ColB from such row. (in df it is 33333)

write result value (Z-Y) in ColD next to 2nd occurrence of X. (should be 33333-11111 = 22222)

Repeat for pairs of 2nd and 3rd occurrence of X, 3rd and 4th, 4th and 5th and so on till the end of df.

Expected result in ColD.


Solution

  • Using some shift-fu and boolean indexing:

    m = df['ColC'].eq('X')
    
    df['ColD'] = df.loc[m, 'ColB'] - df['ColA'].shift(-1)[m].shift()
    

    Or, if you want 0s:

    m = df['ColC'].eq('X')
    
    df['ColD'] = (df['ColB'].sub(df['ColA'].shift(-1)[m].shift())
                    .fillna(0, downcast='infer')
                 )
    

    Output:

         ColA   ColB ColC   ColD
    0       1     11    A      0
    1       2      5    B      0
    2       3     22    X      0
    3   11111     66    C      0
    4       5      4    D      0
    5       6  33333    X  22222
    6       7     45    E      0
    7       8     91    F      0
    8       9     78    G      0
    9       1     10    A      0
    10      2     17    B      0
    11      3     55    X     48
    12      4     73    C      0
    13      5     85    D      0
    14      6     56    X     52
    15      7     99    E      0
    16      8      4    F      0
    17      9     74    G      0
    

    Intermediate:

    # get the value of the row following the previous X
    df['ColA'].shift(-1)[m].shift()
    
    2         NaN
    5     11111.0
    11        7.0
    14        4.0
    Name: ColA, dtype: float64