I need to detect the language of text sent in chat, and I am faced with 2 problems:
For the noise, I clean the message and that works fine, but the length of the message is a problem.
For example, if a user writes "hi", Fasttext detects the language as Dutch text, but Google Translate detects it as English. And most likely it is a message in English.
I try to train my own Fasttext model, but how can I adjust the model to have better results with short strings? Do I need to train the model with the dictionary of a lot of languages to get a better result?
I use Fasttext because it's the most accurate language detector.
Here is an exemple of the problem with Fasttext:
# wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
import fasttext
text = "Hi"
pretrained_lang_model = "lid.176.bin"
model = fasttext.load_model(pretrained_lang_model)
predictions = model.predict(text, k=2)
print(predictions)
# (('__label__de', '__label__en'), array([0.51606238, 0.31865335]))
I have found a way to have better results. If you sum all probabilities of all languages on different detectors like fastText and lingua, and add a dictionary-based detection for short texts, you can have very good results (for my task, I also made a fastText model trained on my data).