Search code examples
pandaslambdagroup-bytransformcalculated-columns

Creating a new column based on group by and logical filtering


I have a DataFrame with columns 'a' and 'b'. I want to create a new column which is the result of this SQL statement.

df['c'] = Select 'a' , count (case when 'b' ==1 then 'a' else null end) from df group by 'a'

How can I achieve the same thing using Pandas?

df = pd.DataFrame({'a':['a','a','b','a','b'], 'b' : [1,0,0,1,1]})
df =    a   b
0       a   1
1       a   0
2       b   0
3       a   1
4       b   1

I tried below

df['c'] = df.groupby('a').apply(lambda x : x[x['b']==1]['a'].count()).reset_index().ix[:,-1]

which generated these incorrect results:

    a   b   c
0   a   1   2
1   a   0   0
2   b   0   NaN
3   a   1   NaN
4   b   1   NaN

While I was expecting [2,NAN,NaN,2,1]

Appreciate all the help and guidance in advance!


Solution

  • I would create a dummy column to do this:

    In [11]: df["c"] = df["b"] == 1
    
    In [12]: df["c"]
    Out[12]:
    0     True
    1    False
    2    False
    3     True
    4    False
    Name: c, dtype: bool
    
    In [13]: df.groupby("a")["c"].sum()
    Out[13]:
    a
    a    2
    b    0
    Name: c, dtype: float64
    

    This avoids the apply so will be more efficient.

    Note: checking equality and summing the booleans is the same as counting the number of entries which are equal.

    If you want this new c column to propogate over df, you can use transform:

    In [14]: df.groupby("a")["c"].transform("sum")
    Out[14]:
    0    2
    1    2
    2    0
    3    2
    4    0
    dtype: float64