Search code examples
pythonpandasdataframerolling-computation

Apply sklearn logloss with rolling on pandas dataframe


My function call looks something like

loss = log_loss(y_true=validate_d['y'], y_pred=validate_probs, sample_weight=validate_df['weight'],  normalize=True)

Is there any way to combine this with pandas rolling() functionality, so I calculate it for a trailing 10k rows window, for example?


Solution

  • I couldn't find a very clean way to make rolling() work on a multi-column dataframe, but here is the best I could do by using a custom window loss function that applies log_loss

    
    import pandas as pd
    import numpy as np
    from sklearn.metrics import log_loss
    
    # Everything in one dataframe, but you can have your pred in a separate one
    # if you want
    df = pd.DataFrame({
        'y': [1, 0, 1, 1, 0, 1, 0, 1],
        'y_pred': [0.7, 0.3, 0.8, 0.9, 0.4, 0.6, 0.2, 0.8],
        'weight': [1.0, 1.5, 0.5, 1.0, 2.0, 1.0, 0.8, 1.2]
    })
    
    def weighted_log_loss(window):
        # window is a series whose contents we're not interested in, we just want
        # the range to `loc` from other data frames
        y = df.loc[window.index, 'y']
        y_pred = df.loc[window.index, 'y_pred']
        weight = df.loc[window.index, 'weight']
        return log_loss(
            y_true=y,
            y_pred=y_pred,
            sample_weight=weight,
            normalize=True
        )
    
    window_size = 3
    print(df['y'].rolling(window=window_size).apply(weighted_log_loss))
    
    
    

    Turns out there is a rolling_apply function (source) which allows directly working with multi-column dataframes and this might suit you better.