Search code examples
huggingface-tokenizershuggingface

Why are the english letters of wav2vec2 tokenizer aren't order as abc characters order?


I looked on the Tokenizer of facebook/wav2vec2-base-960h

from: https://huggingface.co/facebook/wav2vec2-base-960h/blob/main/vocab.json

and I see that the letters are not order by the abc order, for example:

"E": 5, 
"T": 6,
"A": 7,
"O": 8, 

Why they didn't order it as:

"A": 5, 
"B": 6,
"C": 7,
"D": 8, 
...

Solution

  • Because it is based on the frequency of occurrence of the letters in the training data used to train the model