Search code examples
machine-learningdeep-learningnlppytorchtorchtext

How is Vocab and Integer (one hot) representation stored and what does the ('string', int) tuple means in torchtext.vocab()?


I am trying to train an RNN for binary classification. I have my vocab made from 1000000 words and please find the below outputs...

text_field = torchtext.data.Field(tokenize=word_tokenize)

print(text_field.vocab.freqs.most_common(15))
>>
[('.', 516822), (',', 490533), ('the', 464796), ('to', 298670), ("''", 264416), ('of', 226307), ('I', 224927), ('and', 215722), ('a', 211773), ('is', 180965), ('you', 180359), ('``', 165889), ('that', 156425), ('in', 138038), (':', 132294)]
print(text_field.vocab.itos[:15])
>>
['<unk>', '<pad>', '.', ',', 'the', 'to', "''", 'of', 'I', 'and', 'a', 'is', 'you', '``', 'that']
text_field.vocab.stoi
>>
{'<unk>': 0,'<pad>': 1,'.': 2,',': 3,'the': 4,'to': 5,"''": 6,'of': 7,'I': 8,'and': 9,'a': 10, 'is': 11,'you': 12,'``': 13,'that': 14,'in': 15,....................

The documentation says:

freqs – A collections.Counter object holding the frequencies of tokens in the data used to build the Vocab.
stoi – A collections.defaultdict instance mapping token strings to numerical identifiers.
itos – A list of token strings indexed by their numerical identifiers.

Which not comprehensible by me.

Can Some one Please explain what are these by giving the intuition of each of those?

For example, if the is represented by 4, then does it mean that if a sentence contains the word the,

  1. Is it going to be a 1 at location 4? OR
  2. Is it going to be a 1 at position 464796 OR
  3. Is it going to be a 4 at position 464796??

What happens when multiple the are there there??


Solution

  • If "the" is represented by 4, then that means that

    • itos[4] is "the"
    • stoi["the"] is 4
    • there is a tuple ('the', <count>) somewhere in freqs, where count is the number of times that 'the' appears in your input text. That count has nothing to do with its numerical identifier 4.