Search code examples
pythonpandasjupyter-notebookcorrelation

Rolling Correlation of Multi-Column Panda


I am trying to calcualte and then visualize the rolling correlation between multiple columns in a 180 (3 in this example) days window.

My data is formatted like that (in the orginal file there are 12 columns plus the timestamp and thousands of rows):

import numpy as np
    import pandas as pd
    df = pd.DataFrame({"Timestamp" : ['1993-11-01' ,'1993-11-02', '1993-11-03', '1993-11-04','1993-11-15'], "Austria" : [6.18 ,6.18, 6.17, 6.17, 6.40],"Belgium" : [7.05, 7.05, 7.2, 7.5, 7.6],"France" : [7.69, 7.61, 7.67, 7.91, 8.61]},index = [1, 2, 3,4,5])

Timestamp   Austria Belgium France
1   1993-11-01  6.18    7.05    7.69
2   1993-11-02  6.18    7.05    7.61
3   1993-11-03  6.17    7.20    7.67
4   1993-11-04  6.17    7.50    7.91
5   1993-11-15  6.40    7.60    8.61

I cant just use this formula, because I get a formatting error if I do because of the Timestamp column:

df.rolling(2).corr(df)
ValueError: could not convert string to float: '1993-11-01'

When I drop the Timestamp column I get a result of 1.0 for every cell, thats also not right and additionally I lose the Timestamp which I will need for the visualization graph in the end.

df_drop = df.drop(columns=['Timestamp'])
df_drop.rolling(2).corr(df_drop)


Austria Belgium France
1   NaN NaN NaN
2   NaN NaN 1.0
3   1.0 1.0 1.0
4   -inf1.0 1.0
5   1.0 1.0 1.0

Any experiences how to do the rolling correlation with multiple columns and a data index?


Solution

  • Building on the answer of Shreyans Jain I propose the following. It should work with an arbitrary number of columns:

    import itertools as it
    
    # omit timestamp-col
    cols = list(df.columns)[1:]
    # -> ['Austria', 'Belgium', 'France']
    
    col_pairs = list(it.combinations(cols, 2))
    # -> [('Austria', 'Belgium'), ('Austria', 'France'), ('Belgium', 'France')]
    
    res = pd.DataFrame()
    
    for pair in col_pairs:
        # select the first three letters of each name of the pair
        corr_name = f"{pair[0][:3]}_{pair[1][:3]}_corr"
        res[corr_name] = df[list(pair)].\
                        rolling(min_periods=1, window=3).\
                        corr().iloc[0::2, -1].reset_index(drop=True)
    
    print(str(res))
    
       Aus_Bel_corr  Aus_Fra_corr  Bel_Fra_corr
    0           NaN           NaN           NaN
    1           NaN           NaN           NaN
    2     -1.000000     -0.277350      0.277350
    3     -0.755929     -0.654654      0.989743
    4      0.693375      0.969346      0.849167
    
    

    The NaN-Values at the beginning result from the windowing.

    Update: I uploaded a notebook with detailed explanations for what happens inside the loop.

    https://github.com/cknoll/demo-material/blob/main/pandas/pandas_rolling_correlation_iloc.ipynb