My pandas data frame looks something like this:
Movieid review movieRating wordEmbeddingVector
1 "text" 4 [100 dimensional vector]
I am trying to run a doc2vec implementation and I want to be able to group by movie ids and the take the sum of the vectors in wordEmbeddingVector and calculate a cosine similarity between the summed vector and the input vector I tried doing
movie_groupby = movie_data.groupby('movie_id').agg(lambda v : cosineSimilarity(np.sum(movie_data['textvec'])), inputvector)
But it seemed to run for ages and I thought I might be doing something wrong. So I tried to remove the similarity function and just group by and sum. But this seems to not finish as well (well 1 hour and up now) Am I doing something wrong or is it actually just that slow? I have 135392 rows in my data frame so its not massive.
movie_groupby = movie_data.groupby('movie_id').agg(lambda v : np.sum(movie_data['textvec']))
Much Appreciated!
There is a bug in your code. Inside your lambda function you sum across the entire dataframe instead of just the group. This should fix things:
movie_groupby = movie_data.groupby('movie_id').agg(lambda v: np.sum(v['textvec']))
Note: I replaced hotel_data
with movie_data
, but that must have been just a typo.