A couple of years ago, a previous developer for my team wrote the following Python code calling word2vec, passing in a training file and the location of an output file. He worked on Linux. I have been asked to get this running on a Windows machine. Bearing in mind I know next to no Python, I have installed Gensim which I'm guessing implements word2vec now, but do not know how to rewrite the code to use the library rather than the executable which it doesnt seem possible to compile on a Windows box. Could someone help me update this code please?
#!/usr/bin/env python3
import os
import csv
import subprocess
import shutil
from gensim.models import word2vec
def train_word2vec(trainFile, output):
# run word2vec:
subprocess.run(["word2vec", "-train", trainFile, "-output", output,
"-cbow", "0", "-window", "10", "-size", "100"],
shell=False)
# Remove some invalid unicode:
with open(output, 'rb') as input_,\
open('%s.new' % output, 'w') as new_output:
for line in input_:
try:
print(line.decode('utf-8'), file=new_output, end='')
except UnicodeDecodeError:
print(line)
pass
shutil.move('%s.new' % output, output)
def main():
train_word2vec("c:/temp/wc/test1_BigF.txt", "c:/temp/wc/test1_w2v_model.txt")
if __name__ == '__main__':
main()
I think the core of what you're after looks something like this:
import sys
from gensim.models.word2vec import Word2Vec
def train_word2vec(trainFile, output):
# compile word arrays for each sentence of input vocab
sentences = list(line.split() for line in open(trainFile))
# effective executable invocation of original code (included for reference)
# word2vec -train {trainFile} -output {output} -cbow 0 -window 10 -size 100
# invocation via word2vec module with (mostly) equivalent params
model = Word2Vec(sentences, size=100, window=10, min_count=1, workers=4)
# save generated model
model.save(output)
if __name__ == '__main__':
train_word2vec(sys.argv[1], sys.argv[2])
Save as train.py
and invoke as follows:
python train.py input.txt output.txt
A few things to note:
word2vec
) and the imported class (Word2Vec
). It will break if you mix them up.-cbow 0
argument. I'd guess this indicates a preference for the Skip-gram algorithm over CBOW, but would need someone with more gensim
experience than me to advise on its ramifications - or indeed those of leaving it out.Hope this helps a little anyway.