I'm a little confused after reading Bag of tricks for efficient text classification.
What is the difference between args wordNgrams
, minn
and maxn
For example, a text classification task and Glove embedding as pretrainedVectors
ft.train_supervised(file_path,lr=0.1,epoch=5,wordNgrams=2,dim=300,loss='softmax', minn=2,maxn=3,pretrainedVectors='glove.300d.txt',verbose=0)
an input sentence is 'I love you'.
Given minn=2,maxn=3, the whole sentence is transformed into [<I, I>], [<l, <lo, lo, lov,.....]
etc
For the word love, its fasttext embedding = (emb(love) (as a complete word) + emb(<l)+emb(<lo)+....) / n.
For the sentence, it is splitted into [I love, love you]
(because wordNgrams=2) and these 2-gram embeddings are [(fasttext emb(I)+fasttext emb(love))/2, (fasttext emb(love)+fasttext emb(you))/2]
.
The sentence embedding is average of 2-gram embeddings and has dimensionality as 300. Then it is fed through a layer which has #labels neurons (i.e. multiplied with a matrix whose size is [300, #labels]).
Is this right? Please correct me if I'm wrong
raju,
You are almost right, but the averaging happens at the very end.
First, how a sentence is tokenized?
The whole sentence is tokenized with spaces. So "I love you" will produce 4 words: "I", "love", "you" and a special word EOS (end of sentence). So far we have 4 tokens. Then, for each word, depending of what you set for minn
and maxn
, fastText will compute the subwords and consider them as tokens as well. So in your case with minn=2, maxn=3, it will be: "<I", "<I>", "I>", "<l", "<lo", "lo", "lov", "ov", "ove", "ve", "ve>", "e>", "<y", "<yo", "yo", "you", "ou", "ou>", "u>"
(we add beginning and end of word characters (<
and >
) as well).
So the overall tokens will be "I", "love", "you", EOS, "<I", "<I>", "I>", "<l", "<lo", "lo", "lov", "ov", "ove", "ve", "ve>", "e>", "<y", "<yo", "yo", "you", "ou", "ou>", "u>"
.
Now with wordNgrams=2, we also add tokens corresponding to pair of words: "I love", "love you", "you EOS"
Once we have the tokens:
In order to compute the hidden layer, the embedding of the sentence will be the average of the embeddings of individual tokens above. This is done by summing the corresponding column vectors of dimension 300 in the input matrix and we divide by the number of tokens to have the average with this line of code.