Search code examples
pythonpandascumsum

Cumulative Sum of Grouped Strings in Pandas


I have a pandas data frame that I want to group by two columns and then return the cumulative sum of a third column of strings as a list within one of these groups.

Example:

Year  Bucket  Name
2000    1      A
2001    1      B
2003    1      C
2000    2      B
2002    2      C

The output I want is:

Year  Bucket  Cum_Sum
2000    1      [A]
2001    1      [A,B]
2002    1      [A,B]
2003    1      [A,B,C]
2000    2      [B]
2001    2      [B]
2002    2      [B,C]
2003    2      [B,C]

I tried to piece together an answer from two responses: https://stackoverflow.com/a/39623235/5143841 https://stackoverflow.com/a/22651188/5143841

But I can't quite get there.


Solution

  • My Dr. Frankenstein Answer

    dat = []
    rng = range(df.Year.min(), df.Year.max() + 1)
    for b, d in df.groupby('Bucket'):
        for y in rng:
            dat.append([y, b, [*d.Name[d.Year <= y]]])
            
    pd.DataFrame(dat, columns=[*df])
    
       Year  Bucket       Name
    0  2000       1        [A]
    1  2001       1     [A, B]
    2  2002       1     [A, B]
    3  2003       1  [A, B, C]
    4  2000       2        [B]
    5  2001       2        [B]
    6  2002       2     [B, C]
    7  2003       2     [B, C]
    

    Another freaky answer

    rng = range(df.Year.min(), df.Year.max() + 1)
    i = [(y, b) for b, d in df.groupby('Bucket') for y in rng]
    s = df.set_index(['Year', 'Bucket']).Name.map(lambda x: [x])
    s.reindex(i, fill_value=[]).groupby(level=1).apply(pd.Series.cumsum).reset_index()
    
       Year  Bucket       Name
    0  2000       1        [A]
    1  2001       1     [A, B]
    2  2002       1     [A, B]
    3  2003       1  [A, B, C]
    4  2000       2        [B]
    5  2001       2        [B]
    6  2002       2     [B, C]
    7  2003       2     [B, C]