Search code examples
pythonpandasdataframerolling-computation

Python Get k largest values of a Rolling Window for each column


I have a dataset as a Pandas DataFrame, and I am trying to get the k largest values within several rolling windows of different sizes.

Simplified problem:

import pandas as pd
import numpy as np
np.random.seed(42)

def GenerateData(N=20, num_cols=2):
    X = pd.DataFrame(np.random.rand(N, num_cols))
    return X
X = GenerateData()

## >>> X.head()
##    0         1
## 0  0.971595  0.329454
## 1  0.187766  0.138250
## 2  0.573455  0.976918
## 3  0.207987  0.672529
## 4  0.271034  0.549839

My goal is then to get the k largest values within each rolling window for each column. So if k_largest=3 and the rolling window sizes are windows=[4,7], we want the 3 largest values for windows of size 4 and 7. The way I am currently doing this is

def GetKLargestForWindow(windows=[4,7], k_largest=3, raw=False):
    laggedVals = []
    for L in windows:
        for k in range(k_largest):
            x_k_max = X.rolling(L).apply(lambda c: sorted(c, reverse=True)[k], raw=raw)
            x_k_max = x_k_max.add_prefix( f'W{L}_{k+1}_' )
            laggedVals.append( x_k_max )
    laggedVals = pd.concat(laggedVals, axis=1).sort_index(axis=1)
    return laggedVals
laggedVals = GetKLargestForWindow()

## >>> laggedVals.shape
## (20,12)

## >>> laggedVals.columns
## Index(['W4_1_0', 'W4_1_1', 'W4_2_0', 'W4_2_1', \
##  'W4_3_0', 'W4_3_1', 'W7_1_0','W7_1_1', \
##  'W7_2_0', 'W7_2_1', 'W7_3_0', 'W7_3_1'],dtype='object')

Note that there should be 12 columns total in this example. The column names there indicate W{window_size}_{j}_{col} where j=1,2,3 corresponding to the 3 largest values of each window size for each column.

However my dataset is very large and I'm looking for a more efficient way to do this as the code takes very long to run. Any suggestions?


Benchmarks:

import timeit
## >>> timeit.timeit('GetKLargestForWindow()', globals=globals(), number=1000)
## 15.590040199999976

## >>> timeit.timeit('GetKLargestForWindow(raw=True)', globals=globals(), number=1000)
## 6.497314199999892

Edit

I have mostly solved this - huge speedup (especially in larger datasets when you increase N, windows, and k_largest) by setting raw=True in the apply-max function.


Solution

  • As always, if you want speed, use numpy as much as possible. Python loops are extremely slow compared to numpy vectorized code:

    from numpy.lib.stride_tricks import sliding_window_view
    
    def GetKLargestForWindow_CodeDifferent(windows=[4,7], k_largest=3):
        n_row, n_col = X.shape
    
        data = []
        for w in windows:
            # Create a rolling view of size w for each column in the dataframe
            view = sliding_window_view(X, w, axis=0)
            # Sort each view, reverse it (so largest first), and take the first
            # k_largest elements
            view = np.sort(view)[..., ::-1][..., :k_largest]
            # Reshape the numpy array for easy conversion into a dataframe
            view = np.reshape(view, (n_row - w + 1, -1))
            # We know the first `w - 1` rows are all NaN since there are not enough
            # data for the rolling operation
            data.append(np.vstack([
                np.zeros((w - 1, view.shape[1])) + np.nan,
                view
            ]))
    
        # `data` is shaped in this order
        cols_1 = [f"W{w}_{k+1}_{col}" for w in windows for col in range(n_col) for k in range(k_largest)]
        # But we want the columns in this order for easy comparison with the original code
        cols_2 = [f"W{w}_{k+1}_{col}" for w in windows for k in range(k_largest) for col in range(n_col)]
        
        return pd.DataFrame(np.hstack(data), columns=cols_1)[cols_2]
    

    First, let's compare the result:

    X = GenerateData(100_000, 2)
    a = GetKLargestForWindow(raw=True)
    b = GetKLargestForWindow_CodeDifferent()
    
    assert a.compare(b).empty, "a and b are not the same"
    

    Next, let's benchmark them:

    %timeit GetKLargestForWindow(raw=True)
    5.31 s ± 128 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    %timeit GetKLargestForWindow_CodeDifferent()
    54.1 ms ± 761 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)