I'm into a new project which I desire to represent words as vectors, I read about Fasttext library and I saw that they have pre-trained models for language which is not English. The purpose is to predict closeness between different words
what I want to know is can I train a Fasttext model on non-English data and like articles of news sites, to achieve better results for specific genres like politics and nowadays topics, and so.
Thanks in advance!
Can I train it on non-English data sets?
Of course, you can. Fasttext provide a list of available pre-trained models on 157 different languages at their webiste, you can download them as well.
How long does it take to train a model for 10 GB of text?
It depends on your system and implementation. e.g on Mac-pro with 16Gb ram with facebook implementation it takes about 8-10 hours.
is it big enough?
If 10Gb is the file size after cleaning and preprocessing yeah that is fair enough.
There are any better solutions?
What do mean by better solutions? If I were in your shoes, I try the pre-trained models first.