I need to train a fastText model on a 400GB corpus. As I don't have a machine with 400GB of RAM I want to know if the fastText implementation ( for example, following this tutorial https://fasttext.cc/docs/en/unsupervised-tutorial.html ) supports corpus bigger than RAM, and which RAM requirements I would have.
Generally for such models, the peak RAM requirement is a function of the size of the vocabulary of unique words, rather than the raw training material.
So, are there only 100k unique words in your 400GB? No problem, it'll only be reading a range at a time, & updating a small, stable amount of RAM. Are there 50M unique words? You'll need a lot of RAM.
Have you tried it to see what wold happen?