My database contains 1 million unique terms that were typed into my website's search box.
It currently contains two columns: "search term" (the user request) and "volume" (the number of requests for the search term made in a given month). The database is partitioned into monthly tables for the last 10 years. The mean volume is 18 per month. Some searches are missing some month partitions if they were not requested by any users.
I wish to be able to analyse any single term to quickly identify its top n most meaningful, correlated terms using python.
Due to its size, generating an entire correlation matrix would be wasteful in terms of memory and CPU.
What dataframe structure and function would be best suited to this one-to-many comparison in python? And would this function require any detrending to be carried out?
You could build the full correlation matrix every month or perhaps not full but only taking a list of interesting terms on a few-to-all approach. That way you have the stats saved on file.
If you choose to get the one-to-all correlation on demand you can at least build a DataFrame that woks as a cache, by storing the result each time you calculate the correlations of one term.
In order to compute the correlation of one term to all other you can use DataFrame.corrwith:
Say you have the following df:
import string
terms_list = [''.join((a, b, c))
for a in string.ascii_lowercase[:25]
for b in string.ascii_lowercase[:20]
for c in string.ascii_lowercase[:20]]
np.random.seed(1)
df = pd.Series(
np.random.choice(list(np.arange(10, 26)) + [np.nan], int(120e4)),
index = pd.MultiIndex.from_product([terms_list, range(120)],
names=['term', 'month'])
)
df = df.dropna().unstack()
pivot_term = terms_list[0]
print(df)
aaa 15.0 21.0 22.0 18.0 19.0 21.0 15.0 ... NaN 15.0 23.0 11.0 20.0 10.0 17.0
aab 10.0 24.0 23.0 21.0 16.0 23.0 25.0 ... NaN 15.0 12.0 11.0 21.0 15.0 19.0
aac 21.0 11.0 10.0 17.0 10.0 12.0 13.0 ... 10.0 10.0 25.0 14.0 20.0 22.0 15.0
aad NaN 10.0 21.0 22.0 21.0 13.0 22.0 ... 11.0 17.0 12.0 14.0 15.0 17.0 22.0
aae 23.0 10.0 17.0 25.0 19.0 11.0 11.0 ... 10.0 25.0 18.0 16.0 10.0 16.0 11.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ytp 24.0 18.0 16.0 23.0 NaN 19.0 18.0 ... 20.0 15.0 21.0 11.0 14.0 18.0 19.0
ytq 22.0 11.0 17.0 24.0 12.0 20.0 17.0 ... 16.0 NaN 13.0 13.0 18.0 22.0 15.0
ytr 22.0 19.0 20.0 11.0 10.0 20.0 14.0 ... 24.0 21.0 NaN 19.0 10.0 24.0 22.0
yts 22.0 NaN 22.0 17.0 14.0 14.0 25.0 ... 14.0 22.0 NaN 23.0 14.0 25.0 10.0
ytt 17.0 16.0 15.0 21.0 11.0 19.0 16.0 ... 10.0 19.0 19.0 13.0 21.0 18.0 16.0
[10000 rows x 120 columns]
the code
t1 = time()
max_periods = 120
df = df.iloc[:, -max_periods:]
### get correlations
corr = df.drop(pivot_term, axis=0).corrwith(df.loc[pivot_term], axis=1)
t1 = time() - t1
print(corr)
print(t1)
Output
term
aab 0.045972
aac 0.064941
aad -0.057009
aae -0.187645
aaf -0.075473
...
ytp 0.103756
ytq -0.054769
ytr -0.115004
yts 0.123223
ytt 0.230628
Length: 9999, dtype: float64
9.76
From here you can filter interesting terms with corr.nlargest
or corr.nsmallest
.
PS
You might also want to look into a smaller datatype that still fits the maximum volume per month, say np.int16.