python pandas correlation cross-correlation pearson-correlation

One-to-many time series correlation in Python with very high dimensions

My database contains 1 million unique terms that were typed into my website's search box.

It currently contains two columns: "search term" (the user request) and "volume" (the number of requests for the search term made in a given month). The database is partitioned into monthly tables for the last 10 years. The mean volume is 18 per month. Some searches are missing some month partitions if they were not requested by any users.

I wish to be able to analyse any single term to quickly identify its top n most meaningful, correlated terms using python.

Due to its size, generating an entire correlation matrix would be wasteful in terms of memory and CPU.

What dataframe structure and function would be best suited to this one-to-many comparison in python? And would this function require any detrending to be carried out?

Solution

You could build the full correlation matrix every month or perhaps not full but only taking a list of interesting terms on a few-to-all approach. That way you have the stats saved on file.

If you choose to get the one-to-all correlation on demand you can at least build a DataFrame that woks as a cache, by storing the result each time you calculate the correlations of one term.

In order to compute the correlation of one term to all other you can use DataFrame.corrwith:

Say you have the following df:

import string

terms_list = [''.join((a, b, c))
            for a in string.ascii_lowercase[:25]
            for b in string.ascii_lowercase[:20]
            for c in string.ascii_lowercase[:20]]
np.random.seed(1)
df = pd.Series(
    np.random.choice(list(np.arange(10, 26)) + [np.nan], int(120e4)),
    index = pd.MultiIndex.from_product([terms_list, range(120)],
        names=['term', 'month'])
    )
df = df.dropna().unstack()
pivot_term = terms_list[0]

print(df)

aaa    15.0  21.0  22.0  18.0  19.0  21.0  15.0  ...   NaN  15.0  23.0  11.0  20.0  10.0  17.0
aab    10.0  24.0  23.0  21.0  16.0  23.0  25.0  ...   NaN  15.0  12.0  11.0  21.0  15.0  19.0
aac    21.0  11.0  10.0  17.0  10.0  12.0  13.0  ...  10.0  10.0  25.0  14.0  20.0  22.0  15.0
aad     NaN  10.0  21.0  22.0  21.0  13.0  22.0  ...  11.0  17.0  12.0  14.0  15.0  17.0  22.0
aae    23.0  10.0  17.0  25.0  19.0  11.0  11.0  ...  10.0  25.0  18.0  16.0  10.0  16.0  11.0
...     ...   ...   ...   ...   ...   ...   ...  ...   ...   ...   ...   ...   ...   ...   ...
ytp    24.0  18.0  16.0  23.0   NaN  19.0  18.0  ...  20.0  15.0  21.0  11.0  14.0  18.0  19.0
ytq    22.0  11.0  17.0  24.0  12.0  20.0  17.0  ...  16.0   NaN  13.0  13.0  18.0  22.0  15.0
ytr    22.0  19.0  20.0  11.0  10.0  20.0  14.0  ...  24.0  21.0   NaN  19.0  10.0  24.0  22.0
yts    22.0   NaN  22.0  17.0  14.0  14.0  25.0  ...  14.0  22.0   NaN  23.0  14.0  25.0  10.0
ytt    17.0  16.0  15.0  21.0  11.0  19.0  16.0  ...  10.0  19.0  19.0  13.0  21.0  18.0  16.0

[10000 rows x 120 columns]

the code

t1 = time()
max_periods = 120
df = df.iloc[:, -max_periods:]
### get correlations
corr = df.drop(pivot_term, axis=0).corrwith(df.loc[pivot_term], axis=1)
t1 = time() - t1
print(corr)
print(t1)

Output

term
aab    0.045972
aac    0.064941
aad   -0.057009
aae   -0.187645
aaf   -0.075473
         ...
ytp    0.103756
ytq   -0.054769
ytr   -0.115004
yts    0.123223
ytt    0.230628
Length: 9999, dtype: float64
9.76

From here you can filter interesting terms with corr.nlargest or corr.nsmallest.

You might also want to look into a smaller datatype that still fits the maximum volume per month, say np.int16.