I downloaded the latest wiki dump multi-stream bz2. I call the WikiCorpus class from gensim corpora and after 90000 document the vocabulary reaches the highest value (2000000 tokens). I got this in terminal:
keeping 2000000 tokens which were in no less than 0 and no more than 580000 (=100.0%) documents resulting dictionary: Dictionary(2000000 unique tokens: ['ability', 'able', 'abolish', 'abolition', 'about']...) adding document #580000 to Dictionary(2000000 unique tokens: ['ability', 'able', 'abolish', 'abolition', 'about']...)
The WikiCorpus class continues to work until the end of the documents in my bz2. Is there a way to stop it? or to split the bz2 file in a sample? thanks for help!
There's no specific parameter to cap the number of tokens. But when you use WikiCorpus.get_texts()
, you don't have to read them all: you can stop at any time.
If, as suggested by another question of yours, you plan to use the article texts for Gensim Word2Vec
(or a similar model), you don't need the constructor to do its own expensive full-scan vocabulary-discovery. If you supply any dummy object (such as an empty dict
) as the optional dictionary
parameter, it'll skip this unnecessary step. EG:
wiki_corpus = WikiCorpus(filename, dictionary={})
If you also want to use some truncated version of the full set of articles, I'd suggest manually iterating over just a subset of the articles. For example if the subset will easily fit as a list
in RAM, say 50000 articles, that's as simple as:
import itertools
subset_corpus = list(itertools.islice(wiki_corpus, 50000))
If you want to create a subset larger than RAM, iterate over the set number of articles, writing their tokenized texts to a scratch text file, one per line. Then use that file as your later input. (By spending the WikiCorpus
extraction/tokenization effort only once, then reusing the file from disk, this can sometimes offer a performance boost even if you don't need to do it.)