I'm trying to do a simple content based filtering model on the Yelp dataset with data about the restaurants.
I have a DataFrame in this format
>>> business_df.dtypes
address object
attributes object
business_id object
categories object
city object
hours object
is_open object
latitude float64
longitude float64
name object
postal_code object
review_count int64
stars float64
state object
Now I'm trying to build a content-based collaborative filtering model where I'm answering the question "Given a restaurant, recommend similar restaurants"
I'm trying to implement a model given under Content-Based Recommender here - https://www.datacamp.com/community/tutorials/recommender-systems-python
Basically, they use some text fields to build a Count Vectorizer matrix and then do a cosine similarity on the rows to get similarity between movies.
They say later that
Introduce a popularity filter: this recommender would take the 30 most similar movies, calculate the weighted ratings (using the IMDB formula from above), sort movies based on this rating, and return the top 10 movies.
I'm trying to use the Categories, Attributes, Latitude and Logitude (for distance), Stars and Review Count(Stars weighted based on review count - higher number of reviews leads to more weightage for stars) to build a similar model.
But I don't know how to incorporate the numerical columns into the model here. I'm certain I cannot pass the numerical columns into the Count Vectorizer.
Can I build 2 models -- 1 with the text fields and other by simply calculating the cosine similarity(or Pearson correlation) between the numerical columns -- and combine those 2? If yes, how would I do that?
Or could I follow the Data camp model and do the text fields in a model, then use the formula to incorporate ratings? If yes, I would still be unable to do distance based on Latitude-Longitude
Let us assume that the CountVectorize
r gives you a matrix C
of shape (N, m)
where N
= number of restaurants and m = number of features (here the count of the words).
Now since you want to add numerical features, say you have k
such features. You can simply compute these features for each movie and concatenate them to the matrix C
. So for each movie now you will have (m+k)
features. The shape of C
will now be (N, m+k)
. You can use pandas to concatenate.
Now you can simply compute the Cosine Similarity using this matrix and that way you are taking into account the text features as well as the numeric features
However, I would strongly suggest you normalize these values, as some of the numeric features might have larger magnitudes which might lead to poor results. Also instead of the CountVectorizer
, TFIDF matrix or even word embeddings might give you better results