python dictionary jupyter-notebook gensim

Gensim Dictionary slows down drastically whilst making

I have 26,000,000 tweets and am creating a Gensim Dictionary with them. The first approach:

from gensim.corpora import Dictionary

corpus = []
with open(input_file, "r") as file:
    for row in file:
        comment = json.loads(row)        
        corpus.append(comment['text'].split())

gdict = Dictionary(corpus)

Takes about an hour just for the final line. I have also tried the following:

from gensim.corpora import Dictionary

gdict = Dictionary()
with open(input_file, "r") as file:
    for row in file:
        comment = json.loads(row)        
        gdict.add_documents([comment['text'].split()])

This takes 2 minutes to process the first 11,400,000 tweets, and then suddenly slows down, and predicts over 3000 hours to finish.

I have randomised the order of the tweets and the same happens after around the same number of tweets, so it isn't a particular tweet doing it.

I have also grouped the tweets into various batch sizes before using add_documents and the same issue happens for all of them, around the same stage.

With the first approach, the final size of gdict is 1.4mb so it isn't a size issue.

Any ideas?

Solution

Though the questions I asked in a comment are all relevant for diagnosing a general slowdown in this sort of process, looking at the gensim.corpora.Dictionary source, I see another likely culprit for the problem: a rather inefficient & oft-repeated 'pruning', once the Dictionary has 2M entries.

If you have sufficient RAM, supplying a much larger prune_at parameter to either the contructor (when you're also passing the corpus in) or the .add_documents() call (when you're using that) should forestall or eliminate the issue. Ideally, you'd pick a prune_at value larger than the Dictionary ever becomes, so pruning never happens – though that risks exhausting memory, if your corpus has more unique words than can be tracked in RAM.

(Side note: if your comment['text'] fields haven't been preprocessed in any way, then a simple .split() tokenization will likely leave you with lots of extra tokens like words connected to surrounding punctuation or with varying capitalization - which would inflate the token count a lot, and perhaps be less useful for many downstream tasks.)