Search code examples
pythongensimword2vec

Embedding multiword ngram phrases with PathLineSentences in gensim word2vec


I have around 82 gzipped files (around 180MB each and 14GB total) where each file contains new line separated sentences. I am thinking of using PathLineSentences from gensim Word2Vec to train word2vec model on the vocabularies. In that way I do not have to worry about taking all the sentences list into the RAM.

Now I also wanted to get the embedding to include multiword phrases. But from the documentation, it seems that I need to have an already trained phrase detector an all the sentences I have e.g.

from gensim.models import Phrases
# Train a bigram detector.
bigram_transformer = Phrases(all_sentences)
# Apply the trained MWE detector to a corpus, using the result to train a Word2vec model.
model = Word2Vec(bigram_transformer[all_sentences], min_count=1)

Now, I have two questions:

  1. Is there any way I can do the Phrase Detection while running the Word2Vec on top of each of the individual files in a streaming manner?
  2. If not, is there any way I can do the initial phrase detection in the similar fashion of PathLineSentences, as in doing the phrase detection in a streaming manner?

Solution

  • The Gensim Phrases class will accept data in the exact same form as Word2Vec: an iterable of all the tokenized texts.

    You can provide that both as the initial training corpus, then as the corpus to be transformed into paired bigrams.

    However, I would highly suggest that you not try to do the phrase-combinations in a simultaneous stream as feeding to Word2Vec, for both clarity and efficiency reasons.

    Instead, do the transformation once, writing the results to a new, single corpus file. Then:

    • you can easily review the results of the bigram-combinations
    • the pair-by-pair calculations that decide which words will be combined will be done only once, creating a simple corpus of space-delimited tokens. (Otherwise, each of the epochs + 1 passes done by `Word2Vec will need to repeat the same calculations.)

    Roughly that'd look like:

    with open('corpus.txt', 'w') as of:
        for phrased_sentence in bigram_transformer[all_sentences]:
            of.write(' '.join(phrased_sentence)
            of.write('\n')
    

    (You could instead write to a gzipped file like corpus.txt.gz instead, using GzipFile or smart_open's gzip functionality, if you'd like.)

    Then the new file shows you exact data Word2Vec is operating on, and can be fed as a simple corpus - wrapped as an iterable with LineSentence or even passed using the corpus_file option that can better use more workers threads.