Newb here but hoping someone can help me with the code to help me break down a large dataframe. I need to do this over lots of rows (could be hundreds of thousands) so I thought to use Pandas to bring all the data into a dataframe. I'm trying to work out the logic with a smaller subset of data before I try it on the larger dataset where I will either use dask or Pandas with chunksize to bring in the larger dataset... needs to be as memory efficient as possible.
Lets say I have the following dataframe:
a b
0 10 random_data_that_I need
1 23 random_data_that_I_need
2 45 random_data_that_I_need
3 32 random_data_that_I_need
4 15 random_data_that_I_need
5 10 random_data_that_I_need
6 34 random_data_that_I_need
7 65 random_data_that_I_need
8 20 random_data_that_I_need
9 45 random_data_that_I_need
10 11 random_data_that_I_need
11 12 random_data_that_I_need
What I would like to do is to sum up the "a" column until a value is met, lets say my target threshold is 50. Once threshold is met I would like to have all the rows that got me there included as a subset. If the addition of the next row puts me over, its ok, because the previous summation of rows was under the '50' threshold, it should add the next row, but then restart the process. If I have any leftover rows at the end that don't get me to the threshold number, then sum those up.
so the end result would look like
result_df1:
0 10 random_data_that_I need
1 23 random_data_that_I need
2 45 random_data_that_I need
result_df2:
3 32 random_data_that_I need
4 15 random_data_that_I need
5 10 random_data_that_I need
result_df3:
6 34 random_data_that_I need
7 65 random_data_that_I need
result_df4:
8 20 random_data_that_I need
9 45 random_data_that_I need
result_df5:
10 11 random_data_that_I_need
11 12 random_data_that_I_need
The results don't have to be a dataframe... but might be nice if it was...
One way:
df_list = []
old_index = 0
while True:
m = df.iloc[old_index:, :].a.cumsum().sub(50).gt(0)
if any(m):
index = m.idxmax()
else:
break
df1 = df.iloc[old_index:index+1]
df_list.append(df1)
old_index = index + 1
df_list.append(df.iloc[index+1:, :])
[ a b
0 10 random_data_that_I_need
1 23 random_data_that_I_need
2 45 random_data_that_I_need,
a b
3 32 random_data_that_I_need
4 15 random_data_that_I_need
5 10 random_data_that_I_need,
a b
6 34 random_data_that_I_need
7 65 random_data_that_I_need,
a b
8 20 random_data_that_I_need
9 45 random_data_that_I_need,
a b
10 11 random_data_that_I_need
11 12 random_data_that_I_need]
sums = 0
df_list = []
old_index = 0
for index, i in enumerate(df.a):
sums += i
if sums > 50:
df_list.append(df[old_index:index+1])
old_index = index + 1
sums = 0
df_list.append(df[old_index:])