Search code examples
pythonnlpgensimword2vecword-embedding

Sentence iterator to pass to Gensim language model


I am relatively new to NLP and I am trying to create my own words embeddings trained in my personal corpus of docs.

I am trying to implement the following code to create my own wordembedings:

model = gensim.models.Word2Vec(sentences)

with sentences being a list of sentences. Since I can not pass thousands and thousands of sentences I need an iterator

# with mini batch_dir a directory with the text files
# MySentences is a class iterating over sentences.
sentences = MySentences(minibatch_dir) # a memory-friendly iterator

I found this solution by the creator of gensim:

class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()

It does not work for me. How can I create an iterator if I know how to get the list of sentences from every document?

And second very related question: If I am aiming to compare documents similarity in a particular corpus, is always better to create from scratch word embeddings with all the documents of that particular corpus than using GloVec or word2vec? The amount of docs is around 40000.

cheers

More pre


Solution

  • Your illustrated class MySentences assumes one sentence per line. That might not be the case for your data.

    One thing to note is - calling Word2Vec(sentences, iter=1) will run two passes over the sentences iterator (or, in general iter+1 passes; default iter=5). The first pass collects words and their frequencies to build an internal dictionary tree structure. The second and subsequent passes train the neural model. These two (or, iter+1) passes can also be initiated manually, in case your input stream is non-repeatable (you can only afford one pass), and you’re able to initialize the vocabulary some other way:

    model = gensim.models.Word2Vec(iter=1)  # an empty model, no training yet
    model.build_vocab(some_sentences)  # can be a non-repeatable, 1-pass generator
    model.train(other_sentences)  # can be a non-repeatable, 1-pass generator
    

    For example, if you are trying to read dataset stored in a database, your generator function to stream text directly from a database, will throw TypeError:

    TypeError: You can't pass a generator as the sentences argument. Try an iterator.
    

    A generator can be consumed only once and then it’s forgotten. So, you can write a wrapper which has an iterator interface but uses the generator under the hood.

    class SentencesIterator():
        def __init__(self, generator_function):
            self.generator_function = generator_function
            self.generator = self.generator_function()
    
        def __iter__(self):
            # reset the generator
            self.generator = self.generator_function()
            return self
    
        def __next__(self):
            result = next(self.generator)
            if result is None:
                raise StopIteration
            else:
                return result
    

    The generator function is stored as well so it can reset and be used in Gensim like this:

    from gensim.models import FastText
    
    sentences = SentencesIterator(tokens_generator)
    model = FastText(sentences)