I am relatively new to NLP and I am trying to create my own words embeddings trained in my personal corpus of docs.
I am trying to implement the following code to create my own wordembedings:
model = gensim.models.Word2Vec(sentences)
with sentences being a list of sentences. Since I can not pass thousands and thousands of sentences I need an iterator
# with mini batch_dir a directory with the text files
# MySentences is a class iterating over sentences.
sentences = MySentences(minibatch_dir) # a memory-friendly iterator
I found this solution by the creator of gensim:
class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname)):
yield line.split()
It does not work for me. How can I create an iterator if I know how to get the list of sentences from every document?
And second very related question: If I am aiming to compare documents similarity in a particular corpus, is always better to create from scratch word embeddings with all the documents of that particular corpus than using GloVec or word2vec? The amount of docs is around 40000.
cheers
More pre
Your illustrated class MySentences
assumes one sentence per line. That might not be the case for your data.
One thing to note is - calling Word2Vec(sentences, iter=1) will run two passes over the sentences iterator (or, in general iter+1 passes; default iter=5). The first pass collects words and their frequencies to build an internal dictionary tree structure. The second and subsequent passes train the neural model. These two (or, iter+1) passes can also be initiated manually, in case your input stream is non-repeatable (you can only afford one pass), and you’re able to initialize the vocabulary some other way:
model = gensim.models.Word2Vec(iter=1) # an empty model, no training yet
model.build_vocab(some_sentences) # can be a non-repeatable, 1-pass generator
model.train(other_sentences) # can be a non-repeatable, 1-pass generator
For example, if you are trying to read dataset stored in a database, your generator function to stream text directly from a database, will throw TypeError:
TypeError: You can't pass a generator as the sentences argument. Try an iterator.
A generator can be consumed only once and then it’s forgotten. So, you can write a wrapper which has an iterator interface but uses the generator under the hood.
class SentencesIterator():
def __init__(self, generator_function):
self.generator_function = generator_function
self.generator = self.generator_function()
def __iter__(self):
# reset the generator
self.generator = self.generator_function()
return self
def __next__(self):
result = next(self.generator)
if result is None:
raise StopIteration
else:
return result
The generator function is stored as well so it can reset and be used in Gensim like this:
from gensim.models import FastText
sentences = SentencesIterator(tokens_generator)
model = FastText(sentences)