Search code examples
classificationcharacterfasttext

Can fasttext classify on character level?


I am using fasttext model to predict labels for text.

Usually fasttext can classify text on word level such as:

model = fasttext.train_supervised(input="training_fasttextFormat.csv", lr=0.1, epoch=50, loss='hs', wordNgrams=2, dim=200)
print(model.test('testing_fasttextFormat.csv'))

But it seems that the parameter explanation in https://fasttext.cc/docs/en/options.html can do character level as well:

The following arguments for the dictionary are optional:
  -minCount           minimal number of word occurrences [1]
  -minCountLabel      minimal number of label occurrences [0]
  -wordNgrams         max length of word ngram [1]
  -bucket             number of buckets [2000000]
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -t                  sampling threshold [0.0001]
  -label              labels prefix [__label__]

But I am not sure how to use these parameters to run fasttext on character level, Could anyone make an example?


Solution

  • If you're referring to the minn & maxn parameters, in the classic non-classification (not supervised) FastText modes, those control FastText's main difference with original word2vec: learning vectors for word-fragments, in addition to full-word vectors.

    Such word-fragment vectors can then be used to synthesize word-vectors for words that weren't seen during training – "out of vocabulary" (or "oov") words. These synthesized vectors often work fairly well, or at least better than nothing, especially for things like typos or words where word-roots hint strongly at meaning.

    I suspect the excerpt you've quoted only shows 0 as the defaults for minn and maxn in supervised mode, & you'd see other defaults if executing fasttext skipgram (etc) without arguments. (Actually setting these parameters to 0 makes Fasttext for word-modeling essentially plain word2vec.)

    Given that the supervised mode seems to default these to 0 may imply the creators of FastText didn't think, or find, the subword vectors to be as useful in the classification case.

    But, you could certainly try setting them to other values, and checking if they improve your classification results over the defaults.