word2vec/gensim — RuntimeError: you must first build vocabulary before training the model

I am having trouble training my own word2vec model on the .txt files.

The code:

import gensim
import json
import pandas as pd
import glob
import gensim.downloader as api
import matplotlib.pyplot as plt
from gensim.models import KeyedVectors


# loading the .txt files

sentences = []
sentence = []
for doc in glob.glob('./data/*.txt'): 
     with(open(doc, 'r')) as f:
        for line in f:
            line = line.rstrip()
            if line == "":
                if len(sentence) > 0:
                    sentences.append(sentence)
                    sentence = []
            else:
                cols = line.split("\t")
                if len(cols) > 4:
                    form = cols[1]
                    lemma = cols[2]
                    pos = cols[3]
                    if pos != "PONCT":
                        sentence.append(form.lower())


# trying to train the model

from gensim.models import Word2Vec
model_hugo = Word2Vec(sentences, vector_size=200, window=5, epochs=10, sg=1, workers=4)

Message error:

RuntimeError: you must first build vocabulary before training the model

How do I build the vocabulary?

The code works with the sample .conll files, but I want to train the model on my own data.

Solution

Thanks to the @gojomo's suggestion and to this answer, I resolved the empty sentences issue. I needed the following block of code:

# make an iterator that reads your file one line at a time instead of reading everything in memory at once
# reads all the sentences

class SentenceIterator: 
    def __init__(self, filepath): 
        self.filepath = filepath 

    def __iter__(self): 
        for line in open(self.filepath): 
            yield line.split()

before training the model:

# training the model

sentences = SentenceIterator('/content/drive/MyDrive/rousseau/rousseau_corpus.txt') 
model = gensim.models.Word2Vec(sentences, min_count=2) # min_count is for pruning 
                                                       # the internal dictionary. 
                                                       # Words that appear only once 
                                                       # in the corpus are probably 
                                                       # uninteresting typos and garbage. 
                                                       # In addition, there’s not enough 
                                                       # data to make any meaningful 
                                                       # training on those words, so it’s
                                                       # best to ignore them