Search code examples
pythonpython-3.xnltkgensimword2vec

Converting string tokens into integers


I am trying to convert tokens of sentences into integers. But it is giving me floats.

from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec

sometext = "hello how are you doing?"

tokens = word_tokenize(sometext)
model = Word2Vec([tokens], min_count=1, size=1)

when I do,

print(model["hello"])

it gives me,

[-0.3843384]

I want this to be a positive integer.


Solution

  • There's no essential reason to use Word2Vec for this. The point of Word2Vec is to map words to multi-dimensional, "dense" vectors, with many floating-point coordinates.

    Though Word2Vec happens to scan your training corpus for all unique words, and give each unique word an integer position in its internal data-structures, you wouldn't usually make a model of only one-dimension (size=1), or ask the model for the word's integer slot (an internal implementation detail).

    If you just need a (string word)->(int id) mapping, the gensim class Dictionary can do that. See:

    https://radimrehurek.com/gensim/corpora/dictionary.html

    from nltk.tokenize import word_tokenize
    from gensim.corpora.dictionary import Dictionary
    
    sometext = "hello how are you doing?"
    
    tokens = word_tokenize(sometext)
    my_vocab = Dictionary([tokens])
    
    print(my_vocab.token2id['hello'])
    

    Now, if there's actually some valid reason to be using Word2Vec – such as needing the multidimensional vectors for a larger vocabulary, trained on a significant amount of varying text – and your real need is to know its internal integer slots for words, you can access those via the internal wv property's vocab dictionary:

    print(model.wv.vocab['hello'].index)