Search code examples
pythongensimword2vec

word not in vocabulary after training gensim word2vec model, why?


So I want to use word-embeddings in order to get some handy dandy cosine similarity values. After creating the model and checking for similarity of the word "not" (which is in the data I give the model) it tells me that the word is not in the vocabulary.

Why can't it find the similarity for the word 'not'?

the description data looks as follows:
[['not', 'only', 'do', 'angles', 'make', 'joints', 'stronger', 'they', 'also', 'provide', 'more', 'consistent', 'straight', 'corners', 'simpson', 'strongtie', 'offers', 'a', 'wide', 'variety', 'of', 'angles', 'in', 'various', 'sizes', 'and', 'thicknesses', 'to', 'handle', 'lightduty', 'jobs', 'or', 'projects', 'where', 'a', 'structural', 'connection', 'is', 'needed', 'some', 'can', 'be', 'bent', 'skewed', 'to', 'match', 'the', 'project', 'for', 'outdoor', 'projects', 'or', 'those', 'where', 'moisture', 'is', 'present', 'use', 'our', 'zmax', 'zinccoated', 'connectors', 'which', 'provide', 'extra', 'resistance', 'against', 'corrosion', 'look', 'for', 'a', 'z', 'at', 'the', 'end', 'of', 'the', 'model', 'numberversatile', 'connector', 'for', 'various', 'connections', 'and', 'home', 'repair', 'projectsstronger', 'than', 'angled', 'nailing', 'or', 'screw', 'fastening', 'alonehelp', 'ensure', 'joints', 'are', 'consistently', 'straight', 'and', 'strongdimensions', 'in', 'x', 'in', 'x', 'inmade', 'from', 'gauge', 'steelgalvanized', 'for', 'extra', 'corrosion', 'resistanceinstall', 'with', 'd', 'common', 'nails', 'or', 'x', 'in', 'strongdrive', 'sd', 'screws']]

Note that I've already tried to give the data as separate sentences instead of separate words.

def word_vec_sim_sum(row):
    description = row.product_description.split()
    description_embedding = gensim.models.Word2Vec([description], size=150,
        window=10,
        min_count=2,
        workers=10,
        iter=10)       
    print(description_embedding.wv.most_similar(positive="not"))

Solution

  • You need to lower min_count.

    From the documentation: min_count (int, optional) – Ignores all words with total frequency lower than this. In the data you have provided "not" appears once, so it is ignored. By setting min_count to 1 it works.

    import gensim as gensim
    
    data = [['not', 'only', 'do', 'angles', 'make', 'joints', 'stronger', 'they', 'also', 'provide', 'more', 'consistent',
             'straight', 'corners', 'simpson', 'strongtie', 'offers', 'a', 'wide', 'variety', 'of', 'angles', 'in',
             'various', 'sizes', 'and', 'thicknesses', 'to', 'handle', 'lightduty', 'jobs', 'or', 'projects', 'where', 'a',
             'structural', 'connection', 'is', 'needed', 'some', 'can', 'be', 'bent', 'skewed', 'to', 'match', 'the',
             'project', 'for', 'outdoor', 'projects', 'or', 'those', 'where', 'moisture', 'is', 'present', 'use', 'our',
             'zmax', 'zinccoated', 'connectors', 'which', 'provide', 'extra', 'resistance', 'against', 'corrosion', 'look',
             'for', 'a', 'z', 'at', 'the', 'end', 'of', 'the', 'model', 'numberversatile', 'connector', 'for', 'various',
             'connections', 'and', 'home', 'repair', 'projectsstronger', 'than', 'angled', 'nailing', 'or', 'screw',
             'fastening', 'alonehelp', 'ensure', 'joints', 'are', 'consistently', 'straight', 'and', 'strongdimensions',
             'in', 'x', 'in', 'x', 'inmade', 'from', 'gauge', 'steelgalvanized', 'for', 'extra', 'corrosion',
             'resistanceinstall', 'with', 'd', 'common', 'nails', 'or', 'x', 'in', 'strongdrive', 'sd', 'screws']]
    
    
    def word_vec_sim_sum(row):
        description = row
        description_embedding = gensim.models.Word2Vec([description], size=150,
                                                       window=10,
                                                       min_count=1,
                                                       workers=10,
                                                       iter=10)
        print(description_embedding.wv.most_similar(positive="not"))
    
    
    word_vec_sim_sum(data[0])
    

    And the output:

    [('do', 0.21456070244312286), ('our', 0.1713767945766449), ('can', 0.1561305820941925), ('repair', 0.14236785471439362), ('screw', 0.1322808712720871), ('offers', 0.13223429024219513), ('project', 0.11764446645975113), ('against', 0.08542445302009583), ('various', 0.08226475119590759), ('use', 0.08193354308605194)]