My main objective of this question is to calculate the rolling dot_product
or cosine_similarity
over a pandas dataframe. Going through the documentation, I found that, technically we can compute the rolling correlation
function using the following syntax:
df.rolling(window_size).corr()
.
However, I am wondering how to compute the rolling cosine_similarity
. For instance, I would like to have something like:
from sklearn.metrics.pairwise import cosine_similarity
df.rolling(window=3, method="table").apply(lambda table: cosine_similarity(table.T))
However, this is throwing an error.
Kindly note below the entire code to regenerate the same problem.
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
pd.__version__ # 1.4.3
def generate_random_walk(_len):
lst = [np.random.randn()]
for i in range(_len):
lst.append(lst[i] + np.random.randn())
return lst
_len = 100
_num_arrays = 5
array_2D = np.array([generate_random_walk(_len) for _ in range(_num_arrays)]).T
df = pd.DataFrame(array_2D)
df.rolling(window=3, method="table").apply(lambda table: cosine_similarity(table.T))
Please note that I am using:
Python version: 3.10,
and
Pandas version: 1.4.3
I would expect the answer to include a solution using pandas native api. Else, any solution avoiding a for loop is great, given how slow the code can get using for loop. I did personally made some comparison on the corr
function using pandas native functions and using a for loop, and the former was more than 100 times faster. Finally, if no pandas native function is available for the cosine_similarity
; and if there doesn't exist a solution similar to this: df.rolling(window=3, method="table").apply(lambda table: cosine_similarity(table.T))
I would appreciate having a solution using a for loop and numba
for faster computations.
Thanks in advance for the support.
Below is my solution, using numpy
and numba
for faster processing.
import numpy as np
import pandas as pd
from numba import jit, njit
import numba
from sklearn.metrics.pairwise import cosine_similarity
The main objective of this problem is to calculate the rolling cosine_similarity
over a given matrix. In other words, I would like to achieve a similar behavior to the pandas native function: df.rolling(window_size).corr()
which return the corrlation coefficient for each sliding window over a given dataframe. More information about the rolling correlation can be found here. The following figure demonstrate how the sliding window works over axis=0
.
As a result, each window will be passed to the corr()
function, and the result of corr()
is a dataframe of size (N x N) where N is the number of columns in the main dataframe. The output of df.rolling(window_size).corr()
is roughly: (len(df) - window_size + 1, N, N ).
In the following sections, I will go over the main steps for caclulating the cosine similarity over a 2D array (to be covered in section 1). Under section 2, I will show the main implementation in python I will be using numba
to achieve faster processing times. Finally, I will apply the cos_similarity
over a sliding window as shown in the figure above.
The formula for cosine similarity is:
Therefore, if we have a given matrix A with m number of rows and N number of columns, calculating the cosine similarity between each and every col requires us to go through a nested for loop, consuming every pair of columns, and then apply the cosine formula above. A python code snippet will look like this:
def calc_cosine_sim(a, b):
dot_product = a*b
a_norm = np.linalg.norm(a)
b_norm = np.linalg.norm(b)
return dot_product / (a_norm * b_norm)
A = np.array(some_values) # A.shape = (m, N)
lst = []
for i in range(N - 1):
for j in range(i + 1, N):
lst.append(cal_cosine_sim(A[:, i], A[:, j]))
Given that we are calculating the dot product between each and every pair of columns, the same process can be achieved using a matrix multiplication. In other words, if we take the transpose of A and multiply it with itself we should get another matrix who's entries reflect the dot product of A's columns.
Assume we have the following matrix:
Where each row represent one instance, and each col represent a feature, or a vector. Likewise, we will assume:
Multiplying the transpose of A with itself will produce:
Consequently, using matrix multiplication, I have achieved the same result without using for loops. This process is much faster due to vectorization in numpy.
However, the cosine similarity requires us dividing the vector dot product by the norms of both vectors. Therefore, we would love to get this matrix:
However, this is simply:
However,
Therefore, if we simply calculate the norms |a|, |b| and |c|, and then create the above 2 matrices (B and C), that should enable us to calculate the norm_matrix, which when multiplied by A^{T}.A will return the final result, *cosine_similarity matrix. In the next section, I will go through the python code to explain how the theory above gets translated into code.
One final note: C = B^{T}. Therefore, we should only produce B, and what will be used to generate C.
Below is the python implementation using numpy. This is a straight forward implementation, without using numba. I have added comments to reflects the steps and matrices generated above, in section 1. Below I have written 2 methods: calc_rolling_cosine_similarity_v1
and calc_rolling_cosine_similarity_v2
on purpose so that I use both the cosine_similarity
implementation from sklearn and my implementation in numpy. The aim is to compare the numbers at the end and make sure the implementation is correct
# This is the implementation of the above logic (section 1) using numpy library
def calc_cosine_similarity_on_2darray(arr):
'''
Input is 2D array
Return the cosine similarity matrix over the 2D array. In other words, the result is the cosine similarity between
each and every column of the input 2Darray/matrix.
'''
# Equation 1
arr_x_arr = arr.T @ arr
# Calculating Matrix B
arr_norm = np.linalg.norm(arr, axis=0)
arr_norm_r = np.expand_dims(arr_norm, axis=0)
arr_norm_r_m = np.tile(arr_norm_r, (arr_norm_r.shape[1], 1))
# Calculating Matrix C
arr_norm_c_m = arr_norm_r_m.T
# Calculating Matrix norm_matrix
arr_norm_mul = arr_norm_r_m * arr_norm_c_m
# return matrix: cosine_similarity
return arr_x_arr / arr_norm_mul
# This funtion will calculate the rolling cosine similarity using simply the sliding_window_view numpy function and
# cosine_similarity from the sklearn library
def calc_rolling_cosine_similarity_v1(array, window_size, num_features):
array_2D_windowed = np.squeeze(np.lib.stride_tricks.sliding_window_view(array_2D,
window_shape=(window_size, num_features)))
# arr is transposed on purpose because the cosine_similarity from sklearn will compute cos_sim between the matrix rows.
# Therefore, the transpose will put the features, aka columns, in the first dimension.
cos_sim = [cosine_similarity(arr.T) for arr in array_2D_windowed]
return cos_sim
def calc_rolling_cosine_similarity_v2(array, window_size, num_features):
array_2D_windowed = np.squeeze(np.lib.stride_tricks.sliding_window_view(array_2D,
window_shape=(window_size, num_features)))
cos_sim = [calc_cosine_similarity_on_2darray(arr) for arr in array_2D_windowed]
return cos_sim
The same implementation modified to use numba
package in python to speed up our computations. the following numpy functions generated some errors if used with numba:
np.linalg.norm
np.expand_dims
np.tile
Finally, I could have implemented manually the numpy function np.lib.stride_tricks.sliding_window_view
. I am not sure how faster this method could be when used with numba
. I left this one on purpose due to my time constraints.
# this function returns the norm of a vector v.
@njit
def calc_norm(v):
return np.sqrt(np.sum(np.square(v)))
# The implementation here is equivalent to the implementation of calc_cosine_similarity_on_2darray above
# but modified to utilize the numba python package for faster processing.
@njit
def calc_cosine_similarity_on_2darray_numba(arr):
'''
Input is 2D array
Return the similarity matrix over the 2D array. In other words, the result is the cosine similarity between
each and every column of the input 2Darray/matrix.
'''
# Equation 1
arr_x_arr = arr.T @ arr
# Calculating Matrix B
arr_norm_r = np.zeros(shape=(1, arr.shape[1]))
for i in range(arr.shape[1]): # iterate over each col in arr
arr_norm_r[0, i] = calc_norm(arr[:, i]) # calculate the norm of every col in arr.
arr_norm_r_m = np.ones(shape=(arr.shape[1], arr.shape[1])) * arr_norm_r
# Calculating Matrix C
arr_norm_c_m = arr_norm_r_m.T
# Calculating Matrix norm_matrix
arr_norm_mul = arr_norm_r_m * arr_norm_c_m
# return matrix: cosine_similarity
return arr_x_arr / arr_norm_mul
@njit
def rolling_cosine_similarity_numba(array_windowed):
cos_sim = [calc_cosine_similarity_on_2darray_numba(arr) for arr in array_windowed]
return cos_sim
def calc_rolling_cosine_similarity_numba(arr, window_size, num_features):
array_windowed = np.squeeze(np.lib.stride_tricks.sliding_window_view(arr, window_shape=(window_size, num_features)))
cos_sim = rolling_cosine_similarity_numba(array_windowed)
return cos_sim
# Helper function to generate random signal.
def generate_random_walk(_len):
lst = [np.random.randn()]
for i in range(_len - 1):
lst.append(lst[i] + np.random.randn())
return lst
signal_len = 24*365*5
num_features = 5
window_size = 10
array_2D = np.array([generate_random_walk(signal_len) for _ in range(num_features)]).T
array_2D.shape
--> output: (43800, 5)
%%timeit
calc_rolling_cosine_similarity_v1(array_2D, window_size, num_features)
--> output: 9.87 s ± 127 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
calc_rolling_cosine_similarity_v2(array_2D, window_size, num_features)
--> output: 2.79 s ± 100 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
calc_rolling_cosine_similarity_numba(array_2D, window_size, num_features)
--> output: 343 ms ± 4.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sol1 = np.array(calc_rolling_cosine_similarity_v1(array_2D, window_size, num_features))
sol2 = np.array(calc_rolling_cosine_similarity_v2(array_2D, window_size, num_features))
sol3 = np.array(calc_rolling_cosine_similarity_numba(array_2D, window_size, num_features))
sol1.shape, sol2.shape, sol3.shape
--> output: ((43791, 5, 5), (43791, 5, 5), (43791, 5, 5))
np.allclose(sol1, sol2), np.allclose(sol2, sol3), np.allclose(sol1, sol3)
--> output: (True, True, True)
The same solution, when using numba
is almost 9.87 / 0.343 ~ 30 times faster than the native sklearn implementation, and 2.79 / 0.343 ~ 8 times faster than the same numpy implementation without numba.