Search code examples
machine-learningnlpgensimword-embeddingfasttext

Gensim fast text get vocab or word index


Trying to use gensim's fasttext, testing the sample code from gensim with a small change of replacing the arguement to corpus_iterable

https://radimrehurek.com/gensim/models/fasttext.html

gensim_version == 4.0.1

from gensim.models import FastText
from gensim.test.utils import common_texts  # some example sentences

print(common_texts[0])
['human', 'interface', 'computer']
print(len(common_texts))
9
model = FastText(vector_size=4, window=3, min_count=1)  # instantiate
model.build_vocab(corpus_iterable=common_texts)
model.train(corpus_iterable=common_texts, total_examples=len(common_texts), epochs=10)

It works, but is there any way to get the vocab for the model. For example, in Tensorflow Tokenizer there is a word_index which will return all the words. Is there something similar here?


Solution

  • The model stores word vectors in .wv object. I don't know which gensim version you're using, but for Gensim 4 you can get keyed vectors by calling model.wv.key_to_index. You'll get a dict with words and their indices

    from gensim.models import FastText
    from gensim.test.utils import common_texts  # some example sentences
    print(common_texts[0])
    # ['human', 'interface', 'computer']
    print(len(common_texts))
    # 9
    model = FastText(vector_size=4, window=3, min_count=1)  # instantiate
    model.build_vocab(corpus_iterable=common_texts)
    model.train(corpus_iterable=common_texts, total_examples=len(common_texts), epochs=10)
    # get vocab keys with indices
    vocab = model.wv.key_to_index
    print(vocab)
    # output
    # {'system': 0, 'graph': 1, 'trees': 2, 'user': 3, 'minors': 4, 'eps': 5, 'time': 6, 
    # 'response': 7, 'survey': 8, 'computer': 9, 'interface': 10, 'human': 11}