Search code examples
pythongensimword2vecword-embedding

word2vec/gensim — RuntimeError: you must first build vocabulary before training the model


I am having trouble training my own word2vec model on the .txt files.

The code:

import gensim
import json
import pandas as pd
import glob
import gensim.downloader as api
import matplotlib.pyplot as plt
from gensim.models import KeyedVectors


# loading the .txt files

sentences = []
sentence = []
for doc in glob.glob('./data/*.txt'): 
     with(open(doc, 'r')) as f:
        for line in f:
            line = line.rstrip()
            if line == "":
                if len(sentence) > 0:
                    sentences.append(sentence)
                    sentence = []
            else:
                cols = line.split("\t")
                if len(cols) > 4:
                    form = cols[1]
                    lemma = cols[2]
                    pos = cols[3]
                    if pos != "PONCT":
                        sentence.append(form.lower())


# trying to train the model

from gensim.models import Word2Vec
model_hugo = Word2Vec(sentences, vector_size=200, window=5, epochs=10, sg=1, workers=4)

Message error:

RuntimeError: you must first build vocabulary before training the model

How do I build the vocabulary?

The code works with the sample .conll files, but I want to train the model on my own data.


Solution

  • Thanks to the @gojomo's suggestion and to this answer, I resolved the empty sentences issue. I needed the following block of code:

    # make an iterator that reads your file one line at a time instead of reading everything in memory at once
    # reads all the sentences
    
    class SentenceIterator: 
        def __init__(self, filepath): 
            self.filepath = filepath 
    
        def __iter__(self): 
            for line in open(self.filepath): 
                yield line.split() 
    
    

    before training the model:

    # training the model
    
    sentences = SentenceIterator('/content/drive/MyDrive/rousseau/rousseau_corpus.txt') 
    model = gensim.models.Word2Vec(sentences, min_count=2) # min_count is for pruning 
                                                           # the internal dictionary. 
                                                           # Words that appear only once 
                                                           # in the corpus are probably 
                                                           # uninteresting typos and garbage. 
                                                           # In addition, there’s not enough 
                                                           # data to make any meaningful 
                                                           # training on those words, so it’s
                                                           # best to ignore them