Search code examples
deep-learningnlpnamed-entity-recognition

Clarification on the use of Vocab file in NER


I am learning Named Entity Recognition, and i see that the training script uses a variable called vocab which looks like this

vocab = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\'-/\t \n\r\x0b\x0c:"

My Guess is that it is supposed to learn all these characters present in the text like abcd... etc, what i dont understand is the use of char like /n /t what is the use of these char? and in general this variable?

Thanks in advance.


Solution

  • This string is the vocabulary. In the context of NLP, vocabulary is a list of all words or characters used in the training set. In your example the vocabulary is a list of characters. Specifically \n is a newline, and \t a tab.

    For NER and other nlp tasks, we usually use a vocabulary to produce embeddings for each token (word or char), and these embeddings are fed to the machine learning model (nowadays, neural networks architectures such as LSTM are used to get the best results). Character based embeddings have an advantage over word based embeddings for OOV (Out-of-vocabulary) words, i.e. words that do not appear in the training set, but are encountered during inference.