machine-learning nlp text-classification fasttext

training fasttext models with social generated content

I am currently learning about text classification using Facebook FastText. I have found some data from Kaggle that contains characters such as �� or twitter username and hashtags. I tried searching the web however there is no clarification of how you really need to clean/pre-process your text before training a model.

In some blogs I've seen authors writing about tokenisation however its not mentioned in fasttext. Another point it that fasttext git has examples of clean data, such as stackoverflow but nothing for twitter or such platform.

Question is, what is the best practice to pre-process user(social) generated content before training a model? What needs to be redacted?

Thanks

Solution

Since the FastText-Classifier does not work with pretrained embeddings, you can pretty much choose your own way how to clean your data. I would suggest you:

convert everything to lower case (or upper case if you want, it shouldn't matter).
And I would remove special characters beside # and @.

Everything else is up to you. You can decide to keep hashtags, or to remove them, the same is true for usernames. I would probably remove usernames because I guess there isn't a lot information in them. But in some cases it could be informative: Think about tweets about and answers to Donald Trump, his username is often used I guess. Just try what works best for your case. FastText is super fast, so a few experiments won't be much of a problem.