Search code examples
pythondictionarytextprogress-bargensim

Add progress bar (verbose) when creating gensim dictionary


I want to create a gensim dictionary from lines of a dataframe. The df.preprocessed_text is a list of words.

from gensim.models.phrases import Phrases, Phraser
from gensim.corpora.dictionary import Dictionary


def create_dict(df, bigram=True, min_occ_token=3):

    token_ = df.preprocessed_text.values
    if not bigram:
        return Dictionary(token_)
    
    bigram = Phrases(token_,
                     min_count=3,
                     threshold=1,
                     delimiter=b' ')

    bigram_phraser = Phraser(bigram)

    bigram_token = []
    for sent in token_:
        bigram_token.append(bigram_phraser[sent])
    
    dictionary = Dictionary(bigram_token)
    dictionary.filter_extremes(no_above=0.8, no_below=min_occ_token)
    dictionary.compactify() 
    
    return dictionary

I couldn't find a progress bar option for it and the callbacks doesn't seem to work for it too. Since my corpus is huge, I really appreciate a way to show the progress. Is there any?


Solution

  • I'd recommend against changing prune_at for monitoring purposes, as it changes the behavior around which bigrams/words are remembered, possibly discarding many more than is strictly required for capping memory usage.

    Wrapping tqdm around the iterables used (including the token_ use in the Phrases constructor and the bigram_token use in the Dictionary constructor) should work.

    Alternatively, enabling INFO or greater logging should display logging that, while not as pretty/accurate as a progress-bar, will give some indication of progress.

    Further, if as shown in the code, the use of bigram_token is only to support the next Dictionary, it need not be created as a full in-memory list. You should be able to just use layered iterators to transform the text, & tally the Dictionary, item-by-item. EG:

        # ...
        dictionary = Dictionary(tqdm(bigram_phraser[token_]))
        # ...
    

    (Also, if you're only using the Phraser once, you may not be getting any benefit from creating it at all - it's an optional memory optimization for when you want to keep applying the same phrase-creation operation without the full overhead of the original Phrases survey object. But if the Phrases is still in-scope, and all of it will be discarded immediately after this step, it might be just as fast to use the Phrases object directly without ever taking a detour to create the Phraser - so give that a try.)