Search code examples
nlpfasttext

Do documents without labels add information to Facebook's FastText supervised classifier?


I hope you guys are doing great.

I'm training a classifier with Facebook's FastText to determine if a piece of text (tweet) is talking about economy or not. For doing this task, I have about 2200 tagged tweets as "economy" or "not_economy", but I also have almost a million unlabeled tweets.

Reading FastText's documentation I know the supervised input file should be a document with a tweet per line with a prefix of the shape __label__economy or __label__not_economy.

The documentation doesn't talk about adding unlabeled documents to the unsupervised input file, but since it's a word embedding model, it's supposed to take context information from the word's text distribution, so I think giving the model all this extra information should help getting a better embedding representation of my vocabulary. For this reason I'm training the model (with fasttext supervised -input tweets_input -output tweets_model) but I'm also adding untagged documents at the end. The things is that all these almost 1M tweets doesn't seem to be enhancing the model at all.

The other way I know I can take advantage of this data is training a unsupervised model and start using the sentence embedings to train a classifier.

The question is the one in the title:

Do documents without labels add information to Facebook's FastText supervised classifier? Is it better to get the document embeddings and train my own classifier with other library?

Thanks for any information that helps me understand better.


Solution

  • You can't use untagged documents to train the supervised model, because they lack labels.

    You can try this idea: