I have a dataframe as shown below:
vector_a vector_b
[1,2,3] [2,5,6]
[0,2,1] [2,9,1]
[4,7,1] [1,7,4]
I would like to do sklearn's cosine_similarity
between the columns vector_a and vector_b to get a new column called 'cosine_distance' in the same dataframe. Do note that vector_a and vector_b are pandas df
columns of list
This is what I have attempted:
df['vector_a'] = df['vector_a'].apply(lambda x: np.asarray(x))
df['vector_b'] = df['vector_b'].apply(lambda x: np.asarray(x))
df['cosine_distance'] = cosine_similarity(df['vector_a'].apply(lambda x: np.transpose(x)),
df['vector_b'].apply(lambda x: np.transpose(x)))
And I got this error:
---> 58 df['cosine_distance'] = cosine_similarity(df['vector_a'].apply(lambda x: np.transpose(x)), df['vector_b'].apply(lambda x: np.transpose(x)))
~\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in cosine_similarity(X, Y, dense_output)
1025 # to avoid recursive import
-> 1027 X, Y = check_pairwise_arrays(X, Y)
1029 X_normalized = normalize(X, copy=True)
~\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype)
110 else:
111 X = check_array(X, accept_sparse='csr', dtype=dtype,
--> 112 estimator=estimator)
113 Y = check_array(Y, accept_sparse='csr', dtype=dtype,
114 estimator=estimator)
~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
494 try:
495 warnings.simplefilter('error', ComplexWarning)
--> 496 array = np.asarray(array, dtype=dtype, order=order)
497 except ComplexWarning:
498 raise ValueError("Complex data not supported\n"
~\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
537 """
--> 538 return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
Thank you in advance!
df['cosine_similarity'] = df.apply(
lambda row: cosine_similarity([row['vector_a']], [row['vector_b']])[0][0],
expects 2D np.array, or list of lists. It doesn't know how to interpret pd.Series of lists. However, even if we did convert it to list of lists, the next problem arises:cosine_similarity
returns all-vs-all similarity. So, let's limit to pairwise comparison, artificially creating second dimension (note the extra square brackets in [row['vector_a']], [row['vector_b']]
), and then taking the only element of a 1x1 array (zeros at the end of cosine_similarity(...)[0][0]