Search code examples

How to run word2vec on Windows using gensim

A couple of years ago, a previous developer for my team wrote the following Python code calling word2vec, passing in a training file and the location of an output file. He worked on Linux. I have been asked to get this running on a Windows machine. Bearing in mind I know next to no Python, I have installed Gensim which I'm guessing implements word2vec now, but do not know how to rewrite the code to use the library rather than the executable which it doesnt seem possible to compile on a Windows box. Could someone help me update this code please?

#!/usr/bin/env python3

import os
import csv
import subprocess
import shutil

from gensim.models import word2vec

def train_word2vec(trainFile, output):
    # run word2vec:["word2vec", "-train", trainFile, "-output", output,
                    "-cbow", "0", "-window", "10", "-size", "100"],
    # Remove some invalid unicode:
    with open(output, 'rb') as input_,\
         open('' % output, 'w') as new_output:
        for line in input_:
                print(line.decode('utf-8'), file=new_output, end='')
            except UnicodeDecodeError:
    shutil.move('' % output, output)

def main():
    train_word2vec("c:/temp/wc/test1_BigF.txt", "c:/temp/wc/test1_w2v_model.txt")

if __name__ == '__main__':


  • I think the core of what you're after looks something like this:

    import sys
    from gensim.models.word2vec import Word2Vec
    def train_word2vec(trainFile, output):
        # compile word arrays for each sentence of input vocab
        sentences = list(line.split() for line in open(trainFile))
        # effective executable invocation of original code (included for reference)
        # word2vec -train {trainFile} -output {output} -cbow 0 -window 10 -size 100
        # invocation via word2vec module with (mostly) equivalent params
        model = Word2Vec(sentences, size=100, window=10, min_count=1, workers=4)
        # save generated model       
    if __name__ == '__main__':
        train_word2vec(sys.argv[1], sys.argv[2])

    Save as and invoke as follows:

    python input.txt output.txt

    A few things to note:

    • There's different capitalisation used for names of the module (word2vec) and the imported class (Word2Vec). It will break if you mix them up.
    • I've not found/included an equivalent for the command line -cbow 0 argument. I'd guess this indicates a preference for the Skip-gram algorithm over CBOW, but would need someone with more gensim experience than me to advise on its ramifications - or indeed those of leaving it out.
    • Nor have I included (or attempted to reproduce) the Unicode removal logic of the original. The generated model output is largely binary data, so taken 'as is' it (a) falls over pretty much straight away and (b) leaves me rather in the dark as to what it's even trying to achieve.

    Hope this helps a little anyway.