Search code examples
pythonpandasfor-loopvectorization

Sum over one level of multiindex in assignment to new column


I have a pandas Dataframe, whose columns are

cat1, cat2, cat3, cat4, value1, value2

I would like to add a new column value3 and assing it a value that depends on value1 and the sum of value1 over all values in cat4. Is it possible to do this in a vectorized/efficient way, or do I need to write a for loop?

For example if the values were

cat1, cat2, cat3, cat4, value1, value2
A,A,A,A, 3,1
A,A,A,B, 5,2
A,A,A,C, 22,4
A,A,X,A, 0,5
A,A,X,B, 4,5
A,A,X,C, 6,5

My desired endresult would be

cat1, cat2, cat3, cat4, value1, value2, value3
A,A,A,A, 3,1, 30
A,A,A,B, 5,2, 30
A,A,A,C, 22,4, 30
A,A,X,A, 0,5, 10
A,A,X,B, 4,5, 10
A,A,X,C, 6,5, 10

where value3 is the same sum over the value1 values with different cat4 entries when all entries of cat1, cat2, and cat3 are the same. Thanks!


Solution

  • IIUC, you can transform('sum') per groups of cat1/cat2/cat3:

    df['value3'] = (df.groupby(['cat1', 'cat2', 'cat3'])
                    ['value1'].transform('sum'))
    

    output:

      cat1 cat2 cat3 cat4  value1  value2  value3
    0    A    A    A    A       3       1      30
    1    A    A    A    B       5       2      30
    2    A    A    A    C      22       4      30
    3    A    A    X    A       0       5      10
    4    A    A    X    B       4       5      10
    5    A    A    X    C       6       5      10