I have a dataframe with one column containing numbers (quantity). Every row represents one day so whole dataframe is should be treated as sequential data. I want to add second column that would calculate cumulative sum of the quantity column but if at any point cumulative sum is greater than 0, next row should start counting cumulative sum from 0.
I solved this problem using iterrows() but I read that this function is very inefficient and having millions of rows, calculation takes over 20 minutes. My solution below:
import pandas as pd
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])
for index, row in df.iterrows():
if index == 0:
df.loc[index, 'outcome'] = df.loc[index, 'quantity']
else:
previous_outcome = df.loc[index-1, 'outcome']
if previous_outcome > 0:
previous_outcome = 0
df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']
print(df)
# quantity outcome
# -1 -1.0
# -1 -2.0
# -1 -3.0
# -1 -4.0
# 15 11.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# -1 -2.0
# -1 -3.0
# -1 -4.0
# 5 1.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# 15 14.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# -1 -2.0
# -1 -3.0
Is there faster (more optimized way) to calculate this?
I'm also not sure if the "if index == 0" block is the best solution and if this can be solved in more elegant way? Without this block there is an error since in first row there cannot be "previous row" for calculation.
Iterating over DataFrame
rows is very slow and should be avoided. Working with chunks of data is the way to go with pandas
.
For you case, looking at your DataFrame
column quantity
as a numpy
array, the code below should speed up the process quite a lot compared to your approach:
import pandas as pd
import numpy as np
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])
x = np.array(df.quantity)
y = np.zeros(x.size)
total = 0
for i, xi in enumerate(x):
total += xi
y[i] = total
total = total if total < 0 else 0
df['outcome'] = y
print(df)
Out :
quantity outcome
0 -1 -1.0
1 -1 -2.0
2 -1 -3.0
3 -1 -4.0
4 15 11.0
5 -1 -1.0
6 -1 -2.0
7 -1 -3.0
8 -1 -4.0
9 5 1.0
10 -1 -1.0
11 15 14.0
12 -1 -1.0
13 -1 -2.0
14 -1 -3.0
If you still need more speed, suggest to have a look at numba as per jezrael answer.
I got curious about performance and did this module with all 3 approaches.
I haven't optimised the individual functions, just copied the code from OP and jezrael answer with minor changes.
"""
bench_dataframe.py
Performance test of iteration over DataFrame rows.
Methods tested are `DataFrame.iterrows()`, loop over `numpy.array`,
and same using `numba`.
"""
from numba import njit
import pandas as pd
import numpy as np
def pditerrows(df):
"""Iterate over DataFrame using `iterrows`"""
for index, row in df.iterrows():
if index == 0:
df.loc[index, 'outcome'] = df.loc[index, 'quantity']
else:
previous_outcome = df.loc[index-1, 'outcome']
if previous_outcome > 0:
previous_outcome = 0
df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']
return df
def nparray(df):
"""Convert DataFrame column to `numpy` arrays."""
x = np.array(df.quantity)
y = np.zeros(x.size)
total = 0
for i, xi in enumerate(x):
total += xi
y[i] = total
total = total if total < 0 else 0
df['outcome'] = y
return df
@njit
def f(x, lim):
result = np.empty(len(x))
result[0] = x[0]
for i, j in enumerate(x[1:], 1):
previous_outcome = result[i-1]
if previous_outcome > lim:
previous_outcome = 0
result[i] = previous_outcome + x[i]
return result
def numbaloop(df):
"""Convert DataFrame to `numpy` arrays and loop using `numba`.
See [https://stackoverflow.com/a/69750009/5069105]
"""
df['outcome'] = f(df.quantity.to_numpy(), 0)
return df
def create_df(size):
"""Create a DataFrame filed with -1's and 15's, with 90% of
the entries equal to -1 and 10% equal to 15, randomly
placed in the array.
"""
df = pd.DataFrame(
np.random.choice(
(-1, 15),
size=size,
p=[0.9, 0.1]
),
columns=['quantity'])
return df
# Make sure all tests lead to the same result
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1],
columns=['quantity'])
assert nparray(df.copy()).equals(pditerrows(df.copy()))
assert nparray(df.copy()).equals(numbaloop(df.copy()))
Running for a somewhat small array, size = 20_000
, leads to:
In: import bench_dataframe as bd
.. df = bd.create_df(size=20_000)
In: %timeit bd.pditerrows(df.copy())
7.06 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In: %timeit bd.nparray(df.copy())
9.76 ms ± 710 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In: %timeit bd.numbaloop(df.copy())
437 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Here numpy
arrays were 700+ times faster than iterrows()
, and numba
was still 22 times faster than numpy
.
And for larger arrays, size = 200_000
, we get:
In: import bench_dataframe as bd
.. df = bd.create_df(size=200_000)
In: %timeit bd.pditerrows(df.copy())
I gave up and hit Ctrl+C after 10 minutes or so... =P
In: %timeit bd.nparray(df.copy())
86 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In: %timeit bd.numbaloop(df.copy())
3.15 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Making numba
again 25+ times faster than numpy
arrays for this example, and confirming that you should avoid at all costs to use iterrows()
for anything more than a couple of hundreds of rows.