I need to roll through a dataframe and update the estimate in column 2 based on the previous value in column 1 and column 2. I am looking for ideas on how to vectorise this approach/speed it up as it is currently quite costly in the loop format below. The first row is set to 0 as there is no preceding value.
df = pd.DataFrame({'column1': [1, 2, 3, 4, 5]})
for idx, row in df.iterrows():
if idx == 0:
df.loc[idx, 'column2'] = 0
else:
df.loc[idx, 'column2'] = (df.loc[idx - 1, 'column1'] + df.loc[idx - 1, 'column2']) * DECAY
# Expected output -
column1 column2
0 1 0.0
1 2 0.9
2 3 2.61
3 4 5.049
4 5 8.1441
You can use numpy
to get the dot product of a matrix that takes DECAY
into account:
df = pd.DataFrame({"column1": [1, 2, 3, 4, 5]})
DECAY = 0.9
n = len(df)
dm = np.tril(np.power(DECAY, np.subtract.outer(np.arange(n), np.arange(n))), -1)
df["column2"] = dm.dot(df["column1"])
column1 column2
0 1 0.0000
1 2 0.9000
2 3 2.6100
3 4 5.0490
4 5 8.1441
Intermediate steps for clarity:
Create a matrix m
with the subtraction of all pairs from an array with range [0, n]:
m = np.subtract.outer(np.arange(n), np.arange(n))
[[ 0 -1 -2 -3 -4]
[ 1 0 -1 -2 -3]
[ 2 1 0 -1 -2]
[ 3 2 1 0 -1]
[ 4 3 2 1 0]]
Elevate DECAY
to the power of each element in m
:
p = np.power(DECAY, m)
[[1. 1.11111111 1.2345679 1.37174211 1.5241579 ]
[0.9 1. 1.11111111 1.2345679 1.37174211]
[0.81 0.9 1. 1.11111111 1.2345679 ]
[0.729 0.81 0.9 1. 1.11111111]
[0.6561 0.729 0.81 0.9 1. ]]
Keep the lower triangle of the matrix:
dm = np.tril(p, -1)
[[0. 0. 0. 0. 0. ]
[0.9 0. 0. 0. 0. ]
[0.81 0.9 0. 0. 0. ]
[0.729 0.81 0.9 0. 0. ]
[0.6561 0.729 0.81 0.9 0. ]]
Alternative approach using less memory, taking on average 2.7 seconds for the example dataframe on my machine:
df = pd.DataFrame({"column1": np.arange(1, 60001)})
DECAY = 0.9
n = len(df)
c = np.zeros(n)
factors = np.power(DECAY, np.arange(n))
for i in range(1, n):
c[i] = np.dot(factors[1 : i + 1][::-1], df["column1"][:i])
df["column2"] = c
column1 column2
0 1 0.0000
1 2 0.9000
2 3 2.6100
3 4 5.0490
4 5 8.1441
... ... ...
59995 59996 539874.0000
59996 59997 539883.0000
59997 59998 539892.0000
59998 59999 539901.0000
59999 60000 539910.0000
[60000 rows x 2 columns]