python pandas dataframe numpy vectorization

Vectorize DataFrame's operation using only previous rows data for each row

Is there a way to vectorize pandas' DataFrame row operations to calculate using only previous row data without using python-level iteration?

I'm trying to calculate quantiles for a given distribution on each row, but I want to avoid data leakage.

So, for each row, I want to calculate the quantiles for the sequence[:current_row].

import pandas as pd

import numpy as np

np.random.seed(42)

test_df = pd.DataFrame({'column_1':np.random.random(10)})

qtls = pd.DataFrame({'q1':[],
                     'q2':[],
                     'q3':[],
                     'q4':[],
                     'q5':[],
                     'q6':[],
                     'q7':[],
                     'q8':[],
                     'q9':[]})
for i in range(1,len(test_df)+1):
    qtls = pd.concat([qtls, pd.DataFrame({k:[v] for k,v in zip(['q'+str(j) for j in range(1, 10)], np.quantile(test_df['column_1'].iloc[:i], np.arange(0.1,1.0,0.1)))})])
qtls = qtls.reset_index(drop=True)

test_df = pd.concat([test_df, qtls], axis=1)

test_df

So this is the brute-force for loop implementation.

And this is the expected result:

index	column_1	q1	q2	q3	q4	q5	q6	q7	q8	q9
0	0.374540	0.374540	0.374540	0.374540	0.374540	0.374540	0.374540	0.374540	0.374540	0.374540
1	0.950714	0.432158	0.489775	0.547392	0.605010	0.662627	0.720245	0.777862	0.835479	0.893097
2	0.731994	0.446031	0.517522	0.589012	0.660503	0.731994	0.775738	0.819482	0.863226	0.906970
3	0.598658	0.441776	0.509011	0.576247	0.625326	0.665326	0.705327	0.753866	0.819482	0.885098
4	0.156019	0.243427	0.330836	0.419364	0.509011	0.598658	0.651993	0.705327	0.775738	0.863226
5	0.155995	0.156007	0.156019	0.265279	0.374540	0.486599	0.598658	0.665326	0.731994	0.841354
6	0.058084	0.116830	0.155999	0.156014	0.243427	0.374540	0.509011	0.625326	0.705327	0.819482
7	0.866176	0.126621	0.156004	0.177871	0.330836	0.486599	0.625326	0.718660	0.812503	0.891538
8	0.601115	0.136412	0.156009	0.243427	0.419364	0.598658	0.600624	0.679642	0.785667	0.883084
9	0.708073	0.146203	0.156014	0.308984	0.509011	0.599887	0.643898	0.715249	0.758830	0.874630

How can I do it efficiently?

Wouldn't it be easy if there were a df['some_column'] = df['another_column'].cumquantile()?

Solution

I don't think you can really vectorize this, but you could simplify your code to:

np.random.seed(0)
test_df = pd.DataFrame({'column_1':np.random.random(10)})

q = test_df['column_1'].expanding().quantile

out = test_df.join(pd.DataFrame({f'q{i}': q(x) for i, x in
                                 enumerate(np.arange(0.1,1.0,0.1), start=1)}))

which should nevertheless be faster since this is no longer running concat in a loop.

Output:

   column_1        q1        q2        q3        q4        q5        q6        q7        q8        q9
0  0.548814  0.548814  0.548814  0.548814  0.548814  0.548814  0.548814  0.548814  0.548814  0.548814
1  0.715189  0.565451  0.582089  0.598726  0.615364  0.632001  0.648639  0.665277  0.681914  0.698552
2  0.602763  0.559603  0.570393  0.581183  0.591973  0.602763  0.625249  0.647734  0.670219  0.692704
3  0.544883  0.546062  0.547241  0.548420  0.559603  0.575788  0.591973  0.614006  0.647734  0.681462
4  0.423655  0.472146  0.520638  0.545669  0.547241  0.548814  0.570393  0.591973  0.625249  0.670219
5  0.645894  0.484269  0.544883  0.546848  0.548814  0.575788  0.602763  0.624329  0.645894  0.680542
6  0.437587  0.432014  0.459046  0.523424  0.546455  0.548814  0.581183  0.611390  0.637268  0.673612
7  0.891773  0.433407  0.480506  0.545276  0.548027  0.575788  0.611390  0.641581  0.687471  0.768164
8  0.963663  0.434801  0.501965  0.546455  0.559603  0.602763  0.637268  0.687471  0.785823  0.906151
9  0.383442  0.419633  0.434801  0.512694  0.547241  0.575788  0.620016  0.666683  0.750506  0.898962