Search code examples
pythonpandasdataframecumsum

Pandas - Sum column of data until value is met, build subset, rinse and repeat for all rows


Newb here but hoping someone can help me with the code to help me break down a large dataframe. I need to do this over lots of rows (could be hundreds of thousands) so I thought to use Pandas to bring all the data into a dataframe. I'm trying to work out the logic with a smaller subset of data before I try it on the larger dataset where I will either use dask or Pandas with chunksize to bring in the larger dataset... needs to be as memory efficient as possible.

Lets say I have the following dataframe:

   a   b  
0  10  random_data_that_I need 
1  23  random_data_that_I_need
2  45  random_data_that_I_need
3  32  random_data_that_I_need
4  15  random_data_that_I_need
5  10  random_data_that_I_need
6  34  random_data_that_I_need
7  65  random_data_that_I_need
8  20  random_data_that_I_need
9  45  random_data_that_I_need
10 11  random_data_that_I_need
11 12  random_data_that_I_need

What I would like to do is to sum up the "a" column until a value is met, lets say my target threshold is 50. Once threshold is met I would like to have all the rows that got me there included as a subset. If the addition of the next row puts me over, its ok, because the previous summation of rows was under the '50' threshold, it should add the next row, but then restart the process. If I have any leftover rows at the end that don't get me to the threshold number, then sum those up.

so the end result would look like

result_df1:
0  10  random_data_that_I need 
1  23  random_data_that_I need
2  45  random_data_that_I need

result_df2:
3  32  random_data_that_I need
4  15  random_data_that_I need
5  10  random_data_that_I need

result_df3:
6  34  random_data_that_I need
7  65  random_data_that_I need

result_df4:
8  20  random_data_that_I need
9  45  random_data_that_I need

result_df5:
10 11  random_data_that_I_need
11 12  random_data_that_I_need

The results don't have to be a dataframe... but might be nice if it was...


Solution

  • One way:

    df_list = []
    old_index = 0
    while True:
        m = df.iloc[old_index:, :].a.cumsum().sub(50).gt(0)
        if any(m):
            index = m.idxmax()
        else:
            break
        df1 = df.iloc[old_index:index+1]
        df_list.append(df1)
        old_index = index + 1
    
    df_list.append(df.iloc[index+1:, :])
    
    OUTPUT:
    [    a                        b
     0  10  random_data_that_I_need
     1  23  random_data_that_I_need
     2  45  random_data_that_I_need,
         a                        b
     3  32  random_data_that_I_need
     4  15  random_data_that_I_need
     5  10  random_data_that_I_need,
         a                        b
     6  34  random_data_that_I_need
     7  65  random_data_that_I_need,
         a                        b
     8  20  random_data_that_I_need
     9  45  random_data_that_I_need,
          a                        b
     10  11  random_data_that_I_need
     11  12  random_data_that_I_need]
    
    Alternative:
    sums = 0
    df_list = []
    old_index = 0
    for index, i in enumerate(df.a):
        sums += i
        if sums > 50:
            df_list.append(df[old_index:index+1])
            old_index = index + 1
            sums = 0
    df_list.append(df[old_index:])