I have a DataFrame with columns 'a' and 'b'. I want to create a new column which is the result of this SQL statement.
df['c'] = Select 'a' , count (case when 'b' ==1 then 'a' else null end) from df group by 'a'
How can I achieve the same thing using Pandas?
df = pd.DataFrame({'a':['a','a','b','a','b'], 'b' : [1,0,0,1,1]})
df = a b
0 a 1
1 a 0
2 b 0
3 a 1
4 b 1
I tried below
df['c'] = df.groupby('a').apply(lambda x : x[x['b']==1]['a'].count()).reset_index().ix[:,-1]
which generated these incorrect results:
a b c
0 a 1 2
1 a 0 0
2 b 0 NaN
3 a 1 NaN
4 b 1 NaN
While I was expecting [2,NAN,NaN,2,1]
Appreciate all the help and guidance in advance!
I would create a dummy column to do this:
In [11]: df["c"] = df["b"] == 1
In [12]: df["c"]
Out[12]:
0 True
1 False
2 False
3 True
4 False
Name: c, dtype: bool
In [13]: df.groupby("a")["c"].sum()
Out[13]:
a
a 2
b 0
Name: c, dtype: float64
This avoids the apply so will be more efficient.
Note: checking equality and summing the booleans is the same as counting the number of entries which are equal.
If you want this new c column to propogate over df, you can use transform:
In [14]: df.groupby("a")["c"].transform("sum")
Out[14]:
0 2
1 2
2 0
3 2
4 0
dtype: float64