I am having trouble training my own word2vec
model on the .txt
files.
The code:
import gensim
import json
import pandas as pd
import glob
import gensim.downloader as api
import matplotlib.pyplot as plt
from gensim.models import KeyedVectors
# loading the .txt files
sentences = []
sentence = []
for doc in glob.glob('./data/*.txt'):
with(open(doc, 'r')) as f:
for line in f:
line = line.rstrip()
if line == "":
if len(sentence) > 0:
sentences.append(sentence)
sentence = []
else:
cols = line.split("\t")
if len(cols) > 4:
form = cols[1]
lemma = cols[2]
pos = cols[3]
if pos != "PONCT":
sentence.append(form.lower())
# trying to train the model
from gensim.models import Word2Vec
model_hugo = Word2Vec(sentences, vector_size=200, window=5, epochs=10, sg=1, workers=4)
Message error:
RuntimeError: you must first build vocabulary before training the model
How do I build the vocabulary?
The code works with the sample .conll
files, but I want to train the model on my own data.
Thanks to the @gojomo's suggestion and to this answer, I resolved the empty sentences
issue. I needed the following block of code:
# make an iterator that reads your file one line at a time instead of reading everything in memory at once
# reads all the sentences
class SentenceIterator:
def __init__(self, filepath):
self.filepath = filepath
def __iter__(self):
for line in open(self.filepath):
yield line.split()
before training the model:
# training the model
sentences = SentenceIterator('/content/drive/MyDrive/rousseau/rousseau_corpus.txt')
model = gensim.models.Word2Vec(sentences, min_count=2) # min_count is for pruning
# the internal dictionary.
# Words that appear only once
# in the corpus are probably
# uninteresting typos and garbage.
# In addition, there’s not enough
# data to make any meaningful
# training on those words, so it’s
# best to ignore them