I am using fasttext model to predict labels for text.
Usually fasttext can classify text on word level such as:
model = fasttext.train_supervised(input="training_fasttextFormat.csv", lr=0.1, epoch=50, loss='hs', wordNgrams=2, dim=200)
print(model.test('testing_fasttextFormat.csv'))
But it seems that the parameter explanation in https://fasttext.cc/docs/en/options.html can do character level as well:
The following arguments for the dictionary are optional:
-minCount minimal number of word occurrences [1]
-minCountLabel minimal number of label occurrences [0]
-wordNgrams max length of word ngram [1]
-bucket number of buckets [2000000]
-minn min length of char ngram [0]
-maxn max length of char ngram [0]
-t sampling threshold [0.0001]
-label labels prefix [__label__]
But I am not sure how to use these parameters to run fasttext on character level, Could anyone make an example?
If you're referring to the minn
& maxn
parameters, in the classic non-classification (not supervised
) FastText modes, those control FastText's main difference with original word2vec: learning vectors for word-fragments, in addition to full-word vectors.
Such word-fragment vectors can then be used to synthesize word-vectors for words that weren't seen during training – "out of vocabulary" (or "oov") words. These synthesized vectors often work fairly well, or at least better than nothing, especially for things like typos or words where word-roots hint strongly at meaning.
I suspect the excerpt you've quoted only shows 0
as the defaults for minn
and maxn
in supervised
mode, & you'd see other defaults if executing fasttext skipgram
(etc) without arguments. (Actually setting these parameters to 0
makes Fasttext for word-modeling essentially plain word2vec.)
Given that the supervised
mode seems to default these to 0
may imply the creators of FastText didn't think, or find, the subword vectors to be as useful in the classification case.
But, you could certainly try setting them to other values, and checking if they improve your classification results over the defaults.