I have been playing around with Ulmfit a lot lately and still cannot wrap my head around how the language model’s ability to make sound predictions about the next word affects the classification of texts. I guess my real problem is that I do not understand what is happening at the low level of the network. So correct me if I am wrong but the procedure is like this right (?):
The language model gets pre-trained and then fine-tuned. This part seems clear to me: Based on the current and preceding words you form probabilities about the next words. Then the model gets stripped from the softmax layer designed to create the probability distribution. You add the decoder consisting of a reLU-Layer (what is this layer actually doing?) and another softmax layer that outputs the probability of class membership of a given text document. So here are a lot of things I do not understand: How is the text document taken in and processed? Word for word I assume? So how do you end up with the prediction at the end? Is it averaged over all words? Hmm you can see I am very confused. I hope you can help me understand Ulmfit better! Thanks in advance!
ULMFiT's model is "a regular LSTM", which is a special case of a Recurrent Neural Network (RNN).
RNNs "eat" the input text word by word (sometimes character by character), and after every "bite" they:
In text classification, the output is discarded until the very end. The updated hidden state is instead added to the next word to bite. After the RNN ate the last word, you can check the output layer (typically a softmax layer with as many neurons as your labels), compute the loss against the true label, then update the weights accordingly.
After the training phase, suppose you want to classify a new document. The RNN "eats" the input again and updates its hidden state after each word. You disregard the output layer until you see the last word: at that point the max element of the output softmax layer will be your predicted label.
I found particularly helpful this PyTorch tutorial.