I have the text and I want to filter the model with respect to the text. It is OK?
import pandas as pd
import gensim
import nltk
from nltk import word_tokenize
from nltk.collocations import *
from nltk.stem.wordnet import WordNetLemmatizer
import re
text = "though quite simple room solid choice allocated room already used summer holiday apartment bel endroit nice place place winter"
from gensim.models import Word2Vec, KeyedVectors
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz')
model_filter = [w for w in list(model.wv.vocab) if w in text]
If it is OK how to filter in the results (model_filter ) of the most similar function (modelo_filtrado.most_similar_cosmul????), those that belong to the text? Thx.
Your text
is a plain string. The words inside the model
are individual word strings. So, your existing check is looking to see if individual words appear anywhere as substrings inside your text
.
For example, even though 'ice'
does not appear in your text
as a word, this will evaluate as True
:
'ice' in "though quite simple room solid choice allocated room already used summer holiday apartment bel endroit nice place place winter"
You probably want to turn your text
into a list of words, first:
text_words = text.split()
Otherwise, yes, your code will fille model_filter
with only those words that are both in the model
and in
your text
(or text_words
).