Search code examples
pythonneural-networknlpgensimword2vec

gensim word2vec: Find number of words in vocabulary


After training a word2vec model using python gensim, how do you find the number of words in the model's vocabulary?


Solution

  • In recent versions, the model.wv property holds the words-and-vectors, and can itself can report a length – the number of words it contains. So if w2v_model is your Word2Vec (or Doc2Vec or FastText) model, it's enough to just do:

    vocab_len = len(w2v_model.wv)
    

    If your model is just a raw set of word-vectors, like a KeyedVectors instance rather than a full Word2Vec/etc model, it's just:

    vocab_len = len(kv_model)
    

    Other useful internals in Gensim 4.0+ include model.wv.index_to_key, a plain list of the key (word) in each index position, and model.wv.key_to_index, a plain dict mapping keys (words) to their index positions.

    In pre-4.0 versions, the vocabulary was in the vocab field of the Word2Vec model's wv property, as a dictionary, with the keys being each token (word). So there it was just the usual Python for getting a dictionary's length:

    len(w2v_model.wv.vocab)
    

    In very-old gensim versions before 0.13 vocab appeared directly on the model. So way back then you would use w2v_model.vocab instead of w2v_model.wv.vocab.

    But if you're still using anything from before Gensim 4.0, you should definitely upgrade! There are big memory & performance improvements, and the changes required in calling code are relatively small – some renamings & moves, covered in the 4.0 Migration Notes.