Search code examples
text-classificationsupervised-learningfasttext

What is the difference between args wordNgrams, minn and maxn in fassttext supervised learning?


I'm a little confused after reading Bag of tricks for efficient text classification. What is the difference between args wordNgrams, minn and maxn

For example, a text classification task and Glove embedding as pretrainedVectors

ft.train_supervised(file_path,lr=0.1,epoch=5,wordNgrams=2,dim=300,loss='softmax', minn=2,maxn=3,pretrainedVectors='glove.300d.txt',verbose=0)

an input sentence is 'I love you'. Given minn=2,maxn=3, the whole sentence is transformed into [<I, I>], [<l, <lo, lo, lov,.....] etc For the word love, its fasttext embedding = (emb(love) (as a complete word) + emb(<l)+emb(<lo)+....) / n. For the sentence, it is splitted into [I love, love you] (because wordNgrams=2) and these 2-gram embeddings are [(fasttext emb(I)+fasttext emb(love))/2, (fasttext emb(love)+fasttext emb(you))/2]. The sentence embedding is average of 2-gram embeddings and has dimensionality as 300. Then it is fed through a layer which has #labels neurons (i.e. multiplied with a matrix whose size is [300, #labels]).

Is this right? Please correct me if I'm wrong


Solution

  • raju,

    You are almost right, but the averaging happens at the very end.

    First, how a sentence is tokenized?

    The whole sentence is tokenized with spaces. So "I love you" will produce 4 words: "I", "love", "you" and a special word EOS (end of sentence). So far we have 4 tokens. Then, for each word, depending of what you set for minn and maxn, fastText will compute the subwords and consider them as tokens as well. So in your case with minn=2, maxn=3, it will be: "<I", "<I>", "I>", "<l", "<lo", "lo", "lov", "ov", "ove", "ve", "ve>", "e>", "<y", "<yo", "yo", "you", "ou", "ou>", "u>" (we add beginning and end of word characters (< and >) as well).

    So the overall tokens will be "I", "love", "you", EOS, "<I", "<I>", "I>", "<l", "<lo", "lo", "lov", "ov", "ove", "ve", "ve>", "e>", "<y", "<yo", "yo", "you", "ou", "ou>", "u>".

    Now with wordNgrams=2, we also add tokens corresponding to pair of words: "I love", "love you", "you EOS"

    Once we have the tokens:

    In order to compute the hidden layer, the embedding of the sentence will be the average of the embeddings of individual tokens above. This is done by summing the corresponding column vectors of dimension 300 in the input matrix and we divide by the number of tokens to have the average with this line of code.