python twitter nlp word2vec word-embedding

word2vec: user-level, document-level embeddings with pre-trained model

I am currently developing a Twitter content-based recommender system and have a word2vec model pre-trained on 400 million tweets.

How would I go about using those word embeddings to create a document/tweet-level embedding and then get the user embedding based on the tweets they had posted?

I was initially intending on averaging those words in a tweet that had a word vector representation and then averaging the document/tweet vectors to get a user vector but I wasn't sure if this was optimal or even correct. Any help is much appreciated.

Solution

Averaging the vectors of all the words in a short text is one way to get a summary vector for the text. It often works OK as a quick baseline. (And, if all you have, is word-vectors, may be your main option.)

Such a representation might sometimes improve if you did a weighted average based on some other measure of relative term importance (such as TF-IDF), or used raw word-vectors (before normalization to unit-length, as pre-normalization raw magnitudes can sometimes hints at strength-of-meaning).

You could create user-level vectors by averaging all their texts, or by (roughly equivalently) placing all their authored words into a pseudo-document and averaging all those words together.

You might retain more of the variety of a user's posts, especially if their interests span many areas, by first clustering their tweets into N clusters, then modeling the user as the N centroid vectors of the clusters. Maybe even the N varies per user, based on how much they tweet or how far-ranging in topics their tweets seem to be.

With the original tweets, you could also train up per-tweet vectors using an algorithm like 'Paragraph Vector' (aka 'Doc2Vec' in a library like Python gensim.) But, that can have challenging RAM requirements with 400 million distinct documents. (If you have a smaller number of users, perhaps they can be the 'documents', or they could be the predicted classes of a FastText-in-classification-mode training session.)