Search code examples
pythonpandasconditional-statementscumsum

How to group based on cumulative sum that resets on a condition


I have a pandas df with word counts corresponding to articles. I want to be able to be able to add another column MERGED that is based on groups of articles that have a minimum cumulative sum of 'min_words'.

df = pd.DataFrame([[  0,  6],
       [  1,  10],
       [  3,   5],
       [  4,   7],
       [  5,  26],
       [  6,   7],
       [  9,   4],
       [ 10, 133],
       [ 11,  42],
       [ 12,   1]], columns=['ARTICLE', 'WORD_COUNT'])

df
Out[15]: 
   ARTICLE  WORD_COUNT
0        0           6
1        1          10
2        3           5
3        4           7
4        5          26
5        6           7
6        9           4
7       10         133
8       11          42
9       12           1

So then if min_words = 20 this is the desired output:

    df
Out[17]: 
   ARTICLE  WORD_COUNT  MERGED
0        0           6       0
1        1          10       0
2        3           5       0
3        4           7       1
4        5          26       1
5        6           7       2
6        9           4       2
7       10         133       2
8       11          42       3
9       12           1       4

As seen above, it is possible that the final article(s) won't satisfy the min_words condition, and that's ok.


Solution

  • We can only do self def function

    def dymcumsum(v, limit):
         idx = []
         sums = 0
         for i in range(len(v)):
             sums += v[i]
             if sums >= limit:
                 idx.append(i)
                 sums = 0
         return(idx)
    df['New']=np.nan
    df.loc[dymcumsum(df.WORD_COUNT,20),'New']=1
    df.New=df.New.iloc[::-1].eq(1).cumsum()[::-1].factorize()[0]+1
     
    df
       ARTICLE  WORD_COUNT  New
    0        0           6    1
    1        1          10    1
    2        3           5    1
    3        4           7    2
    4        5          26    2
    5        6           7    3
    6        9           4    3
    7       10         133    3
    8       11          42    4
    9       12           1    5