Search code examples
pythonpandasmachine-learningword2vecdoc2vec

Group by and aggregate problems for numpy arrays over word vectors


My pandas data frame looks something like this:

Movieid review  movieRating     wordEmbeddingVector
 1       "text"    4          [100 dimensional vector]

I am trying to run a doc2vec implementation and I want to be able to group by movie ids and the take the sum of the vectors in wordEmbeddingVector and calculate a cosine similarity between the summed vector and the input vector I tried doing

movie_groupby = movie_data.groupby('movie_id').agg(lambda v : cosineSimilarity(np.sum(movie_data['textvec'])), inputvector)

But it seemed to run for ages and I thought I might be doing something wrong. So I tried to remove the similarity function and just group by and sum. But this seems to not finish as well (well 1 hour and up now) Am I doing something wrong or is it actually just that slow? I have 135392 rows in my data frame so its not massive.

movie_groupby = movie_data.groupby('movie_id').agg(lambda v : np.sum(movie_data['textvec']))

Much Appreciated!


Solution

  • There is a bug in your code. Inside your lambda function you sum across the entire dataframe instead of just the group. This should fix things:

    movie_groupby = movie_data.groupby('movie_id').agg(lambda v: np.sum(v['textvec']))
    

    Note: I replaced hotel_data with movie_data, but that must have been just a typo.