Search code examples
pythonknnrecommendation-enginecollaborative-filtering

"Why are the cosine similarities calculated by the library and by myself different?"


I'm currently building a book recommendation system and I want to use KNN algorithm for collaborative filtering. I think I know the process of KNN algorithm well, and I want to use item-based approach for which I need to calculate the similarity between item vectors. However, there's a difference between the similarity calculated by the library and the one I calculated myself, and I'm not sure what the cause is. Can you help me out?

from surprise import Dataset, Reader, KNNWithMeans
# 데이터프레임 생성
ratings_dict = {
    "item": [1, 2, 1, 2, 1, 2, 1, 2, 1],
    "user": ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'E'],
    "rating": [1, 2, 2, 4, 2.5, 4, 4.5, 5, 3],
}
df = pd.DataFrame(ratings_dict)


# Surprise 라이브러리에서 사용할 데이터셋 형태로 변환
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['user', 'item', 'rating']], reader)

# 유사도 행렬 계산 (item_based)
sim_options = {'name': 'cosine', 'user_based': False}
algo = KNNWithMeans(sim_options=sim_options)
trainingSet = data.build_full_trainset()
algo.fit(trainingSet)

similarity_matrix = algo.compute_similarities()
print(similarity_matrix)

this code results

[[1. 0.96954671] [0.96954671 1. ]]

item    1    2
user          
A     1.0  2.0
B     2.0  4.0
C     2.5  4.0
D     4.5  5.0
E     3.0  NaN

but

import numpy as np

# 두 벡터 정의
vector1 = np.array([1, 2, 2.5, 4.5, 3])
vector2 = np.array([2, 4, 4, 5, 0])


# 코사인 유사도 계산
cosine_sim_1 = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))


print(cosine_sim_1)

this code results

0.8550598237348973

I think the surprise library filled NaN values with something other than 0. I expected it to be 0, but it seems like another value was used instead.

I tried ChatGPT, but it couldn't help me solve the issue.


Solution

  • vector1 = np.array([1, 2, 2.5, 4.5])
    vector2 = np.array([2, 4, 4, 5])
    
    # 코사인 유사도 계산
    cosine_sim_1 = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
    print(cosine_sim_1)
    

    The first part of your code just calculates the cosine similarity of the 4D vectors, omitting the last values, one of which is NaN