Search code examples
pythonpandascorrelationcross-correlationpearson-correlation

One-to-many time series correlation in Python with very high dimensions


My database contains 1 million unique terms that were typed into my website's search box.

It currently contains two columns: "search term" (the user request) and "volume" (the number of requests for the search term made in a given month). The database is partitioned into monthly tables for the last 10 years. The mean volume is 18 per month. Some searches are missing some month partitions if they were not requested by any users.

I wish to be able to analyse any single term to quickly identify its top n most meaningful, correlated terms using python.

Due to its size, generating an entire correlation matrix would be wasteful in terms of memory and CPU.

What dataframe structure and function would be best suited to this one-to-many comparison in python? And would this function require any detrending to be carried out?


Solution

  • You could build the full correlation matrix every month or perhaps not full but only taking a list of interesting terms on a few-to-all approach. That way you have the stats saved on file.

    If you choose to get the one-to-all correlation on demand you can at least build a DataFrame that woks as a cache, by storing the result each time you calculate the correlations of one term.

    In order to compute the correlation of one term to all other you can use DataFrame.corrwith:

    Say you have the following df:

    import string
    
    terms_list = [''.join((a, b, c))
                for a in string.ascii_lowercase[:25]
                for b in string.ascii_lowercase[:20]
                for c in string.ascii_lowercase[:20]]
    np.random.seed(1)
    df = pd.Series(
        np.random.choice(list(np.arange(10, 26)) + [np.nan], int(120e4)),
        index = pd.MultiIndex.from_product([terms_list, range(120)],
            names=['term', 'month'])
        )
    df = df.dropna().unstack()
    pivot_term = terms_list[0]
    

    print(df)

    aaa    15.0  21.0  22.0  18.0  19.0  21.0  15.0  ...   NaN  15.0  23.0  11.0  20.0  10.0  17.0
    aab    10.0  24.0  23.0  21.0  16.0  23.0  25.0  ...   NaN  15.0  12.0  11.0  21.0  15.0  19.0
    aac    21.0  11.0  10.0  17.0  10.0  12.0  13.0  ...  10.0  10.0  25.0  14.0  20.0  22.0  15.0
    aad     NaN  10.0  21.0  22.0  21.0  13.0  22.0  ...  11.0  17.0  12.0  14.0  15.0  17.0  22.0
    aae    23.0  10.0  17.0  25.0  19.0  11.0  11.0  ...  10.0  25.0  18.0  16.0  10.0  16.0  11.0
    ...     ...   ...   ...   ...   ...   ...   ...  ...   ...   ...   ...   ...   ...   ...   ...
    ytp    24.0  18.0  16.0  23.0   NaN  19.0  18.0  ...  20.0  15.0  21.0  11.0  14.0  18.0  19.0
    ytq    22.0  11.0  17.0  24.0  12.0  20.0  17.0  ...  16.0   NaN  13.0  13.0  18.0  22.0  15.0
    ytr    22.0  19.0  20.0  11.0  10.0  20.0  14.0  ...  24.0  21.0   NaN  19.0  10.0  24.0  22.0
    yts    22.0   NaN  22.0  17.0  14.0  14.0  25.0  ...  14.0  22.0   NaN  23.0  14.0  25.0  10.0
    ytt    17.0  16.0  15.0  21.0  11.0  19.0  16.0  ...  10.0  19.0  19.0  13.0  21.0  18.0  16.0
    
    [10000 rows x 120 columns]
    

    the code

    t1 = time()
    max_periods = 120
    df = df.iloc[:, -max_periods:]
    ### get correlations
    corr = df.drop(pivot_term, axis=0).corrwith(df.loc[pivot_term], axis=1)
    t1 = time() - t1
    print(corr)
    print(t1)
    

    Output

    term
    aab    0.045972
    aac    0.064941
    aad   -0.057009
    aae   -0.187645
    aaf   -0.075473
             ...
    ytp    0.103756
    ytq   -0.054769
    ytr   -0.115004
    yts    0.123223
    ytt    0.230628
    Length: 9999, dtype: float64
    9.76
    

    From here you can filter interesting terms with corr.nlargest or corr.nsmallest.

    PS

    You might also want to look into a smaller datatype that still fits the maximum volume per month, say np.int16.