I am testing with Word2Vec to find words that have the same meaning, so far it is going great as the list of positive words is accurate. However, I would like to know where each positive word was found, as in which document.
I tried to iterate each document and compare each word with the list of positive words, something like this:
for i in documents: # iterating the documents
for j in i: # iterating the words in the document
for k in similar_words: # iterating the positive words
if k[0] in j: # k[0] is the positive word, k[1] is the positive value
print('found word')
This works fine. However, with this, the positive words are actually stemmed down, that is why I am using "in". So let's say the stemmed down positive word is 'ice', many words contain the phrase 'ice' in them, and maybe more than one are in the document and only one of them is the real positive word.
Is there a way to avoid stemming words when using Word2Vec? Or is there a way to find the document number of each positive word found?
UPDATE
Here is my code for training the model and using 'most_similar()'
def remove_stopwords(texts):
# Removes stopwords in a text
return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
def sent_to_words(sentences):
# Tokenize each sentence into a list of words and remove unwanted characters
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
df = pd.read_excel('my_file.xlsx')
df.columns = map(str.lower, df.columns)
data = df['Comment Section'].values.tolist()
# Remove the new line character and single quotes
data = [re.sub(r'\s+', ' ', str(sent)) for sent in data]
data = [re.sub("\'", "", str(sent)) for sent in data]
# Convert our data to a list of words. Now, data_words is a 2D array,
# each index contains a list of words
data_words = list(sent_to_words(data))
# Remove the stop words
data_words_nostops = remove_stopwords(data_words)
model = gensim.models.Word2Vec(
data_words_nostops,
alpha=0.1,
min_alpha=0.001,
size=250,
window=1,
min_count=2,
workers=10)
model.train(data_words_nostops, total_examples=len(data_words_nostops), epochs=10)
print(model.wv.vocab) # At this step, the words are not stemmed
positive = ['injuries', 'fail', 'dangerous', 'oil']
negative = ['train', 'westward', 'goods', 'calgary', 'car', 'automobile', 'appliance']
similar_words_size = array_length(model.wv.most_similar(positive=positive, negative=negative, topn=0))
for i in model.wv.most_similar(positive=positive, negative=negative, topn=similar_words_size):
if len(i[0]) > 2:
risks.append(i)
print(risks) # At this step, the words are stemmed
A lot of published Word2Vec
work, including the original papers from Google, doesn't bother with word-stemming. If you have a large enough corpus, with many varied examples of each form of a word, then each form will get a pretty-good vector (& closely-positioned with other forms) even as the raw unstemmed words. (On the other hand, in smaller corpuses, stemming is more likely to help, by allowing all the different forms of a word to contribute their occurrences to a single good vector.)
During training, Word2Vec
just watches the training texts go by for the nearby-words information it needs: it doesn't remember the contents of individual documents. If you need that info, you need to retain it outside Word2Vec
, in your own code.
You could iterate over all documents to find occurrences individual words, as in your code. (And, as @alexey's answer notes, you should compare stemmed-words to stemmed-words, rather than just checking for substring-containment.)
The other option, used in full-text search, is to build a "reverse index" that remembers in which documents (and potentially where in each document) each word appears. Then, you essentially have a dictionary in which you look up "iced", and get back a list of documents like "doc1, doc17, doc42". (Or potentially, a list of docs-plus-positions, like "doc2:pos11,pos91; doc17:pos22, doc42:pos77".) That requires more work up-front, and storing the reverse-index (which depending on the level of detail retained, can be nearly as large as the original texts), but then finds docs-containing-words much faster than a full-iteration-search for each word.