Search code examples
pythonnumpypandaspattern-matchingcorrelation

pandas: Rolling correlation with fixed patch for pattern-matching


Happy New Year.

I am looking for a way to compute the correlation of a rolling window and a fixed window ('patch') with pandas. The ultimate objective is to do pattern matching.

From what I read on the docs, AND HOPEFULLY I MISSED SOMETHING, corr() or corrwith() do not allow you to lock one of the Series / DataFrames.

Currently the best crappy solution I could come up is listed bellow. And when this is run on 50K rows with patch of 30 samples the processing time goes into the Ctrl+C range.

I am very grateful for suggestions and alternatives. THANK YOU.

Please run the code below and it will be very clear what I am trying to do:

import numpy as np
import pandas as pd
from pandas import Series
from pandas import DataFrame

# Create test DataFrame df and a patch to be found.
n = 10
rng = pd.date_range('1/1/2000 00:00:00', periods=n, freq='5min')
df = DataFrame(np.random.rand(n, 1), columns=['a'], index=rng)

n = 4
rng = pd.date_range('1/1/2000 00:10:00', periods=n, freq='5min')
patch = DataFrame(np.arange(n), columns=['a'], index=rng)

print
print '    *** Start corr example ***'
# To avoid the automatic alignment between df and patch, 
# I need to reset the index.
patch.reset_index(inplace=True, drop=True)
# Cannot do:
#    df.reset_index(inplace=True, drop=True)

df['corr'] = np.nan

for i in range(df.shape[0]):
    window = df[i : i+patch.shape[0]]
    # If slice has only two rows, I have a line between two points
    # When I corr with to points in patch, I start getting 
    # misleading values like 1 or -1
    if window.shape[0] != patch.shape[0] :
        break
    else:
        # I need to reset_index for the window, 
        # which is less efficient than doing outside the 
        # for loop where the patch has its reset_index done.
        # If I would do the df.reset_index up there, 
        # I would still have automatic realignment but
        # by index.
        window.reset_index(inplace=True, drop=True)

        # On top of the obvious inefficiency
        # of this method, I cannot just corrwith()
        # between specific columns in the dataframe;
        # corrwith() runs for all.
        # Alternatively I could create a new DataFrame
        # only with the needed columns:
        #     df_col = DataFrame(df.a)
        #     patch_col = DataFrame(patch.a)
        # Alternatively I could join the patch to
        # the df and shift it.
        corr = window.corrwith(patch)

        print
        print '==========================='
        print 'window:'
        print window
        print '---------------------------'
        print 'patch:'
        print patch
        print '---------------------------'
        print 'Corr for this window'
        print corr
        print '============================'

        df['corr'][i] = corr.a

print
print '    *** End corr example ***'
print " Please inspect var 'df'"
print

Solution

  • Clearly, the copious use of reset_index is a signal that we are fighting with Panda's indexing and automatic alignment. Oh, how much easier things would be if we could just forget about the index! Indeed, that is what NumPy is for. (Generally speaking, use Pandas when you need alignment or grouping by index, use NumPy when doing computation on N-dimensional arrays.)

    Using NumPy will make the computation much faster because we will be able to remove the for-loop and handle all the computations done in the for-loop as one computation done on a NumPy array of rolling windows.

    We can look inside pandas/core/frame.py's DataFrame.corrwith to see how the computation is done. Then translate it into corresponding code done on NumPy arrays, making adjustments as necessary for the fact that we want to do the computations on a whole array full of rolling windows instead of just one window at a time, while keeping patch constant. (Note: the Pandas corrwith method handles NaNs. To keep the code a bit simpler I've assumed there are no NaNs in the inputs.)

    import numpy as np
    import pandas as pd
    from pandas import Series
    from pandas import DataFrame
    import numpy.lib.stride_tricks as stride
    np.random.seed(1)
    
    n = 10
    rng = pd.date_range('1/1/2000 00:00:00', periods=n, freq='5min')
    df = DataFrame(np.random.rand(n, 1), columns=['a'], index=rng)
    
    m = 4
    rng = pd.date_range('1/1/2000 00:10:00', periods=m, freq='5min')
    patch = DataFrame(np.arange(m), columns=['a'], index=rng)
    
    def orig(df, patch):
        patch.reset_index(inplace=True, drop=True)
    
        df['corr'] = np.nan
    
        for i in range(df.shape[0]):
            window = df[i : i+patch.shape[0]]
            if window.shape[0] != patch.shape[0] :
                break
            else:
                window.reset_index(inplace=True, drop=True)
                corr = window.corrwith(patch)
    
                df['corr'][i] = corr.a
    
        return df
    
    def using_numpy(df, patch):
        left = df['a'].values
        itemsize = left.itemsize
        left = stride.as_strided(left, shape=(n-m+1, m), strides = (itemsize, itemsize))
    
        right = patch['a'].values
    
        ldem = left - left.mean(axis=1)[:, None]
        rdem = right - right.mean()
    
        num = (ldem * rdem).sum(axis=1)
        dom = (m - 1) * np.sqrt(left.var(axis=1, ddof=1) * right.var(ddof=1))
        correl = num/dom
    
        df.ix[:len(correl), 'corr'] = correl
        return df
    
    expected = orig(df.copy(), patch.copy())
    result = using_numpy(df.copy(), patch.copy())
    
    print(expected)
    print(result)
    

    This confirms that the values values generated by orig and using_numpy are the same:

    assert np.allclose(expected['corr'].dropna(), result['corr'].dropna())
    

    Technical note:

    To create the array full of rolling windows in a memory-friendly manner, I used a striding trick I learned here.


    Here is a benchmark, using n, m = 1000, 4 (lots of rows and a tiny patch to generate lots of windows):

    In [77]: %timeit orig(df.copy(), patch.copy())
    1 loops, best of 3: 3.56 s per loop
    
    In [78]: %timeit using_numpy(df.copy(), patch.copy())
    1000 loops, best of 3: 1.35 ms per loop
    

    -- a 2600x speedup.