I have 26,000,000 tweets and am creating a Gensim Dictionary with them. The first approach:
from gensim.corpora import Dictionary
corpus = []
with open(input_file, "r") as file:
for row in file:
comment = json.loads(row)
corpus.append(comment['text'].split())
gdict = Dictionary(corpus)
Takes about an hour just for the final line. I have also tried the following:
from gensim.corpora import Dictionary
gdict = Dictionary()
with open(input_file, "r") as file:
for row in file:
comment = json.loads(row)
gdict.add_documents([comment['text'].split()])
This takes 2 minutes to process the first 11,400,000 tweets, and then suddenly slows down, and predicts over 3000 hours to finish.
I have randomised the order of the tweets and the same happens after around the same number of tweets, so it isn't a particular tweet doing it.
I have also grouped the tweets into various batch sizes before using add_documents
and the same issue happens for all of them, around the same stage.
With the first approach, the final size of gdict is 1.4mb so it isn't a size issue.
Any ideas?
Though the questions I asked in a comment are all relevant for diagnosing a general slowdown in this sort of process, looking at the gensim.corpora.Dictionary
source, I see another likely culprit for the problem: a rather inefficient & oft-repeated 'pruning', once the Dictionary
has 2M entries.
If you have sufficient RAM, supplying a much larger prune_at
parameter to either the contructor (when you're also passing the corpus
in) or the .add_documents()
call (when you're using that) should forestall or eliminate the issue. Ideally, you'd pick a prune_at
value larger than the Dictionary
ever becomes, so pruning never happens – though that risks exhausting memory, if your corpus has more unique words than can be tracked in RAM.
(Side note: if your comment['text']
fields haven't been preprocessed in any way, then a simple .split()
tokenization will likely leave you with lots of extra tokens like words connected to surrounding punctuation or with varying capitalization - which would inflate the token count a lot, and perhaps be less useful for many downstream tasks.)