Search code examples
pythonpandasdataframegroup-by

Python pandas groupby: how to use variables in different columns to create a new one


Consider the following DataFrame:

df = pd.DataFrame({'c0':['1980']*3+['1990']*2+['2000']*3,
                   'c1':['x','y','z']+['x','y']+['x','y','z'],
                   'c2':range(8)  })

     c0 c1  c2
0  1980  x   0
1  1980  y   1
2  1980  z   2
3  1990  x   3
4  1990  y   4
5  2000  x   5
6  2000  y   6
7  2000  z   7

I want to do the following using pandas's groupby over c0:

  1. Group rows based on c0 (indicate year).
  2. In each group, subtract the value of c2 for y (in c1) from the values of c2.
  3. Add a new column c3 collecting those values.

The final result is

     c0 c1  c2  c3
0  1980  x   0  -1
1  1980  y   1   0
2  1980  z   2   1
3  1990  x   3  -1
4  1990  y   4   0
5  2000  x   5  -1
6  2000  y   6   0
7  2000  z   7   1

I was able to get the result without groupby like the following:

dic = {}

for yr in df['c0'].unique():
    
    cond1 = ( df['c0']==yr )
    tmp = df.loc[cond1,:].copy()
    
    cond2 = ( tmp['c1']=='y' )
    val = tmp.loc[cond2,'c2'].to_numpy()
    
    tmp['c3'] = tmp['c2'] - val
    
    dic[yr] = tmp


pd.concat([dic['1980'],dic['1990'],dic['2000']])

It works but does not look great. I tried transform and apply for groupby, but could not figure it out. Any help would be appreciated.


Solution

  • You can broadcast y value with transform after hiding all non-y rows for the group using where:

    df['c3' ] = df['c2'] - df.where(df['c1'] == 'y').groupby(df['c0'])['c2'].transform('max')
    print(df)
    
    # Output
         c0 c1  c2   c3
    0  1980  x   0 -1.0
    1  1980  y   1  0.0
    2  1980  z   2  1.0
    3  1990  x   3 -1.0
    4  1990  y   4  0.0
    5  2000  x   5 -1.0
    6  2000  y   6  0.0
    7  2000  z   7  1.0