Search code examples
pythonpandasdataframegroup-by

What is the best way to filter groups by two lambda conditions and create a new column based on the conditions?


This is my DataFrame:

import pandas as pd

df = pd.DataFrame(
    {
        'a': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'z', 'z', 'z', 'p', 'p', 'p', 'p'],
        'b': [1, -1, 1, 1, -1, 1, 1, -1, -1, -1, -1, 1, 1, 1, 1]
    }
)

And this the expected output. I want to create column c:

    a  b    c
0   x  1    first
1   x -1    first
2   x  1    first
3   x  1    first
4   y -1    second
5   y  1    second
6   y  1    second
7   y -1    second
11  p  1    first
12  p  1    first
13  p  1    first
14  p  1    first

Groups are defined by column a. I want to filter df and choose groups that either their first b is 1 OR their second b is 1.

I did this by this code:

df1 = df.groupby('a').filter(lambda x: (x.b.iloc[0] == 1) | (x.b.iloc[1] == 1))

And for creating column c for df1, again groups should be defined by a and then if for each group first b is 1 then c is first and if the second b is 1 then c is second.

Note that for group p, both first and second b is 1, for these groups I want c to be first.

Maybe the way that I approach the issue is totally wrong.


Solution

  • A generic method that works with any number of positions for the first 1:

    d = {0: 'first', 1: 'second'}
    
    s = (df.groupby('a')['b']
           .transform(lambda g: g.reset_index()[g.values==1]
                      .first_valid_index())
           .replace(d)
         )
    
    out = df.assign(c=s).dropna(subset=['c'])
    

    Notes:

    • if you remove the replace step you will get an integer in c
    • if you use map in place of replace you can ignore the positions that are not defined as a dictionary key

    Output:

        a  b       c
    0   x  1   first
    1   x -1   first
    2   x  1   first
    3   x  1   first
    4   y -1  second
    5   y  1  second
    6   y  1  second
    7   y -1  second
    11  p  1   first
    12  p  1   first
    13  p  1   first
    14  p  1   first
    

    Example from comments:

    df = pd.DataFrame({'a': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'z', 'z', 'z', 'p', 'p', 'p', 'p'],
                      'b': [1, -1, 1, 1, -1, 1, 1, -1, -1, -1, 1, 1, 1, 1, 1]})
    
    d = {0: 'first', 1: 'second'}
    
    s = (df.groupby('a')['b']
           .transform(lambda g: g.reset_index()[g.values==1]
                      .first_valid_index())
           .map(d)
         )
    
    out = df.assign(c=s).dropna(subset=['c'])
    
        a  b       c
    0   x  1   first
    1   x -1   first
    2   x  1   first
    3   x  1   first
    4   y -1  second
    5   y  1  second
    6   y  1  second
    7   y -1  second
    11  p  1   first
    12  p  1   first
    13  p  1   first
    14  p  1   first
    

    You can also only filter the rows with:

    m1 = df.groupby('a').cumcount().le(1)
    m2 = df['b'].eq(1)
    out = df.loc[df['a'].isin(df.loc[m1&m2, 'a'])]