I am trying to train an RNN
for binary classification. I have my vocab made from 1000000 words and please find the below outputs...
text_field = torchtext.data.Field(tokenize=word_tokenize)
print(text_field.vocab.freqs.most_common(15))
>>
[('.', 516822), (',', 490533), ('the', 464796), ('to', 298670), ("''", 264416), ('of', 226307), ('I', 224927), ('and', 215722), ('a', 211773), ('is', 180965), ('you', 180359), ('``', 165889), ('that', 156425), ('in', 138038), (':', 132294)]
print(text_field.vocab.itos[:15])
>>
['<unk>', '<pad>', '.', ',', 'the', 'to', "''", 'of', 'I', 'and', 'a', 'is', 'you', '``', 'that']
text_field.vocab.stoi
>>
{'<unk>': 0,'<pad>': 1,'.': 2,',': 3,'the': 4,'to': 5,"''": 6,'of': 7,'I': 8,'and': 9,'a': 10, 'is': 11,'you': 12,'``': 13,'that': 14,'in': 15,....................
The documentation says:
freqs – A collections.Counter object holding the frequencies of tokens in the data used to build the Vocab.
stoi – A collections.defaultdict instance mapping token strings to numerical identifiers.
itos – A list of token strings indexed by their numerical identifiers.
Which not comprehensible by me.
Can Some one Please explain what are these by giving the intuition of each of those?
For example, if the
is represented by 4
, then does it mean that if a sentence contains the word the
,
What happens when multiple the
are there there??
If "the" is represented by 4, then that means that
itos[4]
is "the"stoi["the"]
is 4('the', <count>)
somewhere in freqs
, where count
is the number of times that 'the' appears in your input text. That count has nothing to do with its numerical identifier 4.