python algorithm machine-learning nlp word2vec

Improve speed of python algorithm

I have used Sentiment140 dataset for twitter for sentiment analysis

Code:

getting words from tweets:

tweet_tokens = []
[tweet_tokens.append(dev.get_tweet_tokens(idx)) for idx, item in enumerate(dev)]

getting unknown words from tokens

words_without_embs = []
[[words_without_embs.append(w) for w in tweet if w not in word2vec] for tweet in tweet_tokens]
len(words_without_embs)

last part of code, calculate vector as the mean of left and right words (context)

vectors = {} # alg
for word in words_without_embs:
  mean_vectors = []
  for tweet in tweet_tokens:
    if word in tweet:
      idx = tweet.index(word)
      try:
        mean_vector = np.mean([word2vec.get_vector(tweet[idx-1]), word2vec.get_vector(tweet[idx+1])], axis=0)
        mean_vectors.append(mean_vector)
      except:
        pass

    if tweet == tweet_tokens[-1]: # last iteration
      mean_vector_all_tweets = np.mean(mean_vectors, axis=0)
      vectors[word] = mean_vector_all_tweets

There are 1058532 words and the last part of this code works very slow, about 250 words per minute.

How can you improve the speed of this algorithm?

Solution

More-common (& probably better) strategies for dealing with unknown words include:

training/using a model, like FastText, that can offer guess-vectors for out-of-vocabulary (OOV) words
acquiring more training data, so vectors for more unknown words can be learned from real usages
ignoring unknown words entirely

It seems you've decided to instead synthesize new vectors for OOV words, by averaging all of their immediate neighbors. I don't think this would work especially well. In many kinds of downstream uses of the word-vectors, it just tends to overweight the word's in-context neighbors – which can also be very simply/cheaply achieved by just ignoring the unknown word entirely.

But given what you want to do, the best approach would be to collect the neighboring words during the same pass that identifies the words_without_embs.

For example, make words_without_embs a dict (or perhaps a DefaultDict), where each key is a word that will need a vector, and each value is a list of all the neighboring words you've found so far.

Then, one loop over the tweet_tokens would both fill the words_without_embs with keys that are the words needing vectors, while stuffing those values with all the neighboring-words seen so far.

Then, one last loop over the words_without_embs keys would simply grab the existing lists of neighbor-words for the averaging. (No more multiple passes over tweet_tokens.)

But again: all this work might not outperform the baseline practice of simply dropping unknown words.