I want to create a gensim dictionary from lines of a dataframe. The df.preprocessed_text
is a list of words.
from gensim.models.phrases import Phrases, Phraser
from gensim.corpora.dictionary import Dictionary
def create_dict(df, bigram=True, min_occ_token=3):
token_ = df.preprocessed_text.values
if not bigram:
return Dictionary(token_)
bigram = Phrases(token_,
min_count=3,
threshold=1,
delimiter=b' ')
bigram_phraser = Phraser(bigram)
bigram_token = []
for sent in token_:
bigram_token.append(bigram_phraser[sent])
dictionary = Dictionary(bigram_token)
dictionary.filter_extremes(no_above=0.8, no_below=min_occ_token)
dictionary.compactify()
return dictionary
I couldn't find a progress bar option for it and the callbacks doesn't seem to work for it too. Since my corpus is huge, I really appreciate a way to show the progress. Is there any?
I'd recommend against changing prune_at
for monitoring purposes, as it changes the behavior around which bigrams/words are remembered, possibly discarding many more than is strictly required for capping memory usage.
Wrapping tqdm
around the iterables used (including the token_
use in the Phrases
constructor and the bigram_token
use in the Dictionary
constructor) should work.
Alternatively, enabling INFO
or greater logging should display logging that, while not as pretty/accurate as a progress-bar, will give some indication of progress.
Further, if as shown in the code, the use of bigram_token
is only to support the next Dictionary
, it need not be created as a full in-memory list
. You should be able to just use layered iterators to transform the text, & tally the Dictionary
, item-by-item. EG:
# ...
dictionary = Dictionary(tqdm(bigram_phraser[token_]))
# ...
(Also, if you're only using the Phraser
once, you may not be getting any benefit from creating it at all - it's an optional memory optimization for when you want to keep applying the same phrase-creation operation without the full overhead of the original Phrases
survey object. But if the Phrases
is still in-scope, and all of it will be discarded immediately after this step, it might be just as fast to use the Phrases
object directly without ever taking a detour to create the Phraser
- so give that a try.)