Search code examples
pythonpandasdataframenumpyvectorization

Vectorize DataFrame's operation using only previous rows data for each row


Is there a way to vectorize pandas' DataFrame row operations to calculate using only previous row data without using python-level iteration?

I'm trying to calculate quantiles for a given distribution on each row, but I want to avoid data leakage.

So, for each row, I want to calculate the quantiles for the sequence[:current_row].

import pandas as pd

import numpy as np

np.random.seed(42)

test_df = pd.DataFrame({'column_1':np.random.random(10)})

qtls = pd.DataFrame({'q1':[],
                     'q2':[],
                     'q3':[],
                     'q4':[],
                     'q5':[],
                     'q6':[],
                     'q7':[],
                     'q8':[],
                     'q9':[]})
for i in range(1,len(test_df)+1):
    qtls = pd.concat([qtls, pd.DataFrame({k:[v] for k,v in zip(['q'+str(j) for j in range(1, 10)], np.quantile(test_df['column_1'].iloc[:i], np.arange(0.1,1.0,0.1)))})])
qtls = qtls.reset_index(drop=True)

test_df = pd.concat([test_df, qtls], axis=1)

test_df

So this is the brute-force for loop implementation.

And this is the expected result:

index column_1 q1 q2 q3 q4 q5 q6 q7 q8 q9
0 0.374540 0.374540 0.374540 0.374540 0.374540 0.374540 0.374540 0.374540 0.374540 0.374540
1 0.950714 0.432158 0.489775 0.547392 0.605010 0.662627 0.720245 0.777862 0.835479 0.893097
2 0.731994 0.446031 0.517522 0.589012 0.660503 0.731994 0.775738 0.819482 0.863226 0.906970
3 0.598658 0.441776 0.509011 0.576247 0.625326 0.665326 0.705327 0.753866 0.819482 0.885098
4 0.156019 0.243427 0.330836 0.419364 0.509011 0.598658 0.651993 0.705327 0.775738 0.863226
5 0.155995 0.156007 0.156019 0.265279 0.374540 0.486599 0.598658 0.665326 0.731994 0.841354
6 0.058084 0.116830 0.155999 0.156014 0.243427 0.374540 0.509011 0.625326 0.705327 0.819482
7 0.866176 0.126621 0.156004 0.177871 0.330836 0.486599 0.625326 0.718660 0.812503 0.891538
8 0.601115 0.136412 0.156009 0.243427 0.419364 0.598658 0.600624 0.679642 0.785667 0.883084
9 0.708073 0.146203 0.156014 0.308984 0.509011 0.599887 0.643898 0.715249 0.758830 0.874630

How can I do it efficiently?

Wouldn't it be easy if there were a df['some_column'] = df['another_column'].cumquantile()?


Solution

  • I don't think you can really vectorize this, but you could simplify your code to:

    np.random.seed(0)
    test_df = pd.DataFrame({'column_1':np.random.random(10)})
    
    q = test_df['column_1'].expanding().quantile
    
    out = test_df.join(pd.DataFrame({f'q{i}': q(x) for i, x in
                                     enumerate(np.arange(0.1,1.0,0.1), start=1)}))
    

    which should nevertheless be faster since this is no longer running concat in a loop.

    Output:

       column_1        q1        q2        q3        q4        q5        q6        q7        q8        q9
    0  0.548814  0.548814  0.548814  0.548814  0.548814  0.548814  0.548814  0.548814  0.548814  0.548814
    1  0.715189  0.565451  0.582089  0.598726  0.615364  0.632001  0.648639  0.665277  0.681914  0.698552
    2  0.602763  0.559603  0.570393  0.581183  0.591973  0.602763  0.625249  0.647734  0.670219  0.692704
    3  0.544883  0.546062  0.547241  0.548420  0.559603  0.575788  0.591973  0.614006  0.647734  0.681462
    4  0.423655  0.472146  0.520638  0.545669  0.547241  0.548814  0.570393  0.591973  0.625249  0.670219
    5  0.645894  0.484269  0.544883  0.546848  0.548814  0.575788  0.602763  0.624329  0.645894  0.680542
    6  0.437587  0.432014  0.459046  0.523424  0.546455  0.548814  0.581183  0.611390  0.637268  0.673612
    7  0.891773  0.433407  0.480506  0.545276  0.548027  0.575788  0.611390  0.641581  0.687471  0.768164
    8  0.963663  0.434801  0.501965  0.546455  0.559603  0.602763  0.637268  0.687471  0.785823  0.906151
    9  0.383442  0.419633  0.434801  0.512694  0.547241  0.575788  0.620016  0.666683  0.750506  0.898962