I am trying to apply word embedding on tweets. I was trying to create a vector for each tweet by taking the average of the vectors of the words present in the tweet as follow:
def word_vector(tokens, size):
vec = np.zeros(size).reshape((1, size))
count = 0.
for word in tokens:
try:
vec += model_w2v[word].reshape((1, size))
count += 1.
except KeyError: # handling the case where the token is not in vocabulary
continue
if count != 0:
vec /= count
return vec
Next, when I try to Prepare word2vec feature set as follow:
wordvec_arrays = np.zeros((len(tokenized_tweet), 200))
#the length of the vector is 200
for i in range(len(tokenized_tweet)):
wordvec_arrays[i,:] = word_vector(tokenized_tweet[i], 200)
wordvec_df = pd.DataFrame(wordvec_arrays)
wordvec_df.shape
I get the following error inside the loop:
ValueError Traceback (most recent call last) <ipython-input-32-72aee891e885> in <module> 4 # wordvec_arrays.reshape(1,200) 5 for i in range(len(tokenized_tweet)): ----> 6 wordvec_arrays[i,:] = word_vector(tokenized_tweet[i], 200) 7 8 wordvec_df = pd.DataFrame(wordvec_arrays) <ipython-input-31-9e6501810162> in word_vector(tokens, size) 4 for word in tokens: 5 try: ----> 6 vec += model_w2v.wv.__getitem__(word).reshape((1, size)) 7 count += 1. 8 except KeyError: # handling the case where the token is not in vocabulary ValueError: cannot reshape array of size 3800 into shape (1,200)
I checked all the available posts in stackOverflow but non of them really helped me.
I tried reshaping the array and it still give me the same error.
My model is:
tokenized_tweet = df['tweet'].apply(lambda x: x.split()) # tokenizing
model_w2v = gensim.models.Word2Vec(
tokenized_tweet,
size=200, # desired no. of features/independent variables
window=5, # context window size
min_count=2,
sg = 1, # 1 for skip-gram model
hs = 0,
negative = 10, # for negative sampling
workers= 2, # no.of cores
seed = 34)
model_w2v.train(tokenized_tweet, total_examples= len(df['tweet']), epochs=20)
any suggestions please?
It looks like the intent of your word_vector()
method is to take a list of words, and then with respect to a given Word2Vec
model, return the average of all those words' vectors (when present).
To do that, you shouldn't need to do any explicit re-shaping of vectors – or even specification of size
, because that's forced by what the model already provides. You could use utility methods from numpy
to simplify the code a lot. For example, the gensim
n_similarity()
method, as part of its comparision of two lists-of-words, already does an averaging much like what you're trying, and you can look at its source as a model:
So, while I haven't tested this code, I think your word_vector()
method could be essentially replaced with:
import numpy as np
def average_words_vectors(tokens, wv_model):
vectors = [wv_model[word] for word in tokens
if word in wv_model] # avoiding KeyError
return np.array(vectors).mean(axis=0)
(It's sometimes the case that it makes sense to work with vectors that have been normalized to unit-length - as the linked gensim
code via applying gensim.matutils.unitvec()
to the average. I haven't done this here, as your method hadn't taken that step – but it is something to consider.)
Separate observations about your Word2Vec
training code:
typically words with just 1, 2, or a few occurrences don't get good vectors (due to limited number & variety of examples), but do interfere with the improvement of other more-common-word vectors. That's why the default is min_count=5
. So just be aware: your surviving vectors may get better if you use a default (or even larger) value here, discarding more of the rarer words.
the dimensions of a "dense embedding" like word2vec-vectors aren't really "independent variables" (or standalone individually-interpretable "features") as implied by your code-comment, even though they may seem that way as separate values/slots in the data. For example, you can't pick one dimension out and conclude, "that's the foo-ness of this sample" (like 'coldness' or 'hardness' or 'positiveness' etc). Rather, any of those human-describable meanings tend to be other directions in the combined-space, not perfectly aligned with any of the individual dimensions. You can sort-of tease those out by comparing vectors, and downstream ML algorithms can make use of those complicated/entangled multi-dimensional interactions. But if you think of each dimensions as its own "feature" – in any way other than yes, it's technically a single number associated with the item – you may be prone to misinterpreting the vector-space.