Search code examples
pythonpython-3.xgensim

'pseudocorpus' no longer available from 'gensim.models.phrases'?


Several months ago, I used "pseudocorpus" to create a fake corpus as part of phrase training using Gensim with the following code:

from gensim.models.phrases import pseudocorpus 

corpus = pseudocorpus(bigram_model.vocab, bigram_model.delimiter, bigram_model.common_terms)
bigrams = []
for bigram, score in bigram_model.export_phrases(corpus, bigram_model.delimiter, as_tuples=False):
    if score >= bigram_model.threshold:
        bigrams.append(bigram.decode('utf-8'))

Now when I run the code, I got the following error message:

ImportError: cannot import name 'pseudocorpus' from 'gensim.models.phrases'

I'm using Gensim 4.2.0. Is pseudocorpus() no longer available with Gensim 4.2.0?

Thanks a lot!


Solution

  • I believe the main internal consumer of a pseudocorpus() result, the .export_phrases() method, was improved to achieve the same goals more efficiently, so that method disappeared – as it hadn't really been promoted as part of the public functionality of the module.

    Can you make use of .export_phrases() for your purposes?

    If not, can you say a bit more about how you were using the (odd synthetic) 'pseudocorpus'?

    If all else fails, the prior functionality was a pretty simple extraction from the model's state, and you can view the last version of the function before it was refactored-away at the project's open source repository:

    https://github.com/RaRe-Technologies/gensim/blob/da8847a04f9ee56702cb81a0218cd5a57e1f24e6/gensim/models/phrases.py#L750

    So, you could simply use that as a guide to reimplementing equivalent extraction in your own code.