I'd like to create a big gensim dictionary for french language to try getting better results in topic detection, similarities between texts and other things like that. So I've planned to use a wikipedia dump and process it the following way:
Because of the very large size of the corpus, I don't store anything in memory and access the corpus via smart_open but it appears gensim Phrases model is consuming too much RAM to complete the third step.
Here is my sample code:
corpus = smart_open(corpusFile, "r")
phrases = gensim.models.Phrases()
with smart_open(phrasesFile, "wb") as phrases_file:
chunks_size = 10000
texts, i = [], 0
for text in corpus:
texts.append(text.split())
i += 1
if i % chunks_size == 0:
phrases.add_vocab(texts)
texts = []
phrases.save(phrases_file)
corpus.close()
Is there a way to complete the operation without freezing my computer or will I have to train the Phrases model only on a subset of my corpus?
I'm answering myself because I realized I forgot to deal with some memory related parameters in the Phrases class.
So, first I've divided max_vocab_size by 2 so it should consume less memory, and also I've decided to save the Phrases object every 100 000 articles and then reload it from the saved file as these kind of tricks have shown they can be helpful with some other classes in the gensim lib...
Here is the new code, a little slower maybe but it has completed the task successfully:
corpus = smart_open(corpusFile, "r")
max_vocab_size=20000000
phrases = Phrases(max_vocab_size=max_vocab_size)
chunks_size = 10000
save_every = 100000
texts, i = [], 0
for text in corpus:
texts.append(text.split())
i += 1
if i % chunks_size == 0:
phrases.add_vocab(texts)
texts = []
if i % save_every == 0:
phrases.save(phrasesFile)
phrases = Phrases.load(phrasesFile)
corpus.close()
phrases.save(phrasesFile)
Ending up with 412 816 phrasegrams in my case after putting all this in a Phraser object.