Search code examples
pythonnlpgensimwikipediadump

is there a way to stop creation of vocabulary in gensim.WikiCorpus when reach 2000000 tokens?


I downloaded the latest wiki dump multi-stream bz2. I call the WikiCorpus class from gensim corpora and after 90000 document the vocabulary reaches the highest value (2000000 tokens). I got this in terminal:

keeping 2000000 tokens which were in no less than 0 and no more than 580000 (=100.0%) documents resulting dictionary: Dictionary(2000000 unique tokens: ['ability', 'able', 'abolish', 'abolition', 'about']...) adding document #580000 to Dictionary(2000000 unique tokens: ['ability', 'able', 'abolish', 'abolition', 'about']...)

The WikiCorpus class continues to work until the end of the documents in my bz2. Is there a way to stop it? or to split the bz2 file in a sample? thanks for help!


Solution

  • There's no specific parameter to cap the number of tokens. But when you use WikiCorpus.get_texts(), you don't have to read them all: you can stop at any time.

    If, as suggested by another question of yours, you plan to use the article texts for Gensim Word2Vec (or a similar model), you don't need the constructor to do its own expensive full-scan vocabulary-discovery. If you supply any dummy object (such as an empty dict) as the optional dictionary parameter, it'll skip this unnecessary step. EG:

    wiki_corpus = WikiCorpus(filename, dictionary={})
    

    If you also want to use some truncated version of the full set of articles, I'd suggest manually iterating over just a subset of the articles. For example if the subset will easily fit as a list in RAM, say 50000 articles, that's as simple as:

    import itertools
    subset_corpus = list(itertools.islice(wiki_corpus, 50000))
    

    If you want to create a subset larger than RAM, iterate over the set number of articles, writing their tokenized texts to a scratch text file, one per line. Then use that file as your later input. (By spending the WikiCorpus extraction/tokenization effort only once, then reusing the file from disk, this can sometimes offer a performance boost even if you don't need to do it.)