Search code examples
pythonmodelnltkword2vec

How to filter a model with respect to text and then use most_similar?


I have the text and I want to filter the model with respect to the text. It is OK?

import pandas as pd
import gensim
import nltk
from nltk import word_tokenize
from nltk.collocations import *
from nltk.stem.wordnet import WordNetLemmatizer
import re

text = "though quite simple room solid choice allocated room already used summer holiday apartment bel endroit nice place place winter"
from gensim.models import Word2Vec,  KeyedVectors

model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz')
model_filter = [w for w in list(model.wv.vocab) if w in text]

If it is OK how to filter in the results (model_filter ) of the most similar function (modelo_filtrado.most_similar_cosmul????), those that belong to the text? Thx.


Solution

  • Your text is a plain string. The words inside the model are individual word strings. So, your existing check is looking to see if individual words appear anywhere as substrings inside your text.

    For example, even though 'ice' does not appear in your text as a word, this will evaluate as True:

    'ice' in "though quite simple room solid choice allocated room already used summer holiday apartment bel endroit nice place place winter"
    

    You probably want to turn your text into a list of words, first:

    text_words = text.split()
    

    Otherwise, yes, your code will fille model_filter with only those words that are both in the model and in your text (or text_words).