Gensim Doc2Vec: I'm gettting different vectors from documents that are identical

I have the following code and I think I am getting the vectors in a wrong way, because for example the vectors of two documents that are 100% identical are not the same.

def getDocs(corpusPath):
    """Function for processings documents as TaggedDocument"""
    # Loop over all the files in corpus
    for file in glob.glob(os.path.join(corpusPath, '*.csv')):
        # getWords is a function that gets the words from the provided directory
        # os.path.basename(file) takes the filename from the complete path
        yield TaggedDocument(words=getWords(file), tags=[os.path.basename(file)])

def getModel(corpusPath, outputName):
    # Get documents words from path
    documents = getDocs(corpusPath)

    cores = multiprocessing.cpu_count()

    # Initialize the model
    model = models.doc2vec.Doc2Vec(vector_size=100, epochs=10, min_count=1, max_vocab_size=None, alpha=0.025, min_alpha=0.01, workers=cores)

    # Build Vocabulary

    # Train the model
    model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

    # Save the model as shown below
    model.save_word2vec_format(outputName, doctag_vec=True, word_vec=False, prefix="")

And the output has to be like this:

12571 100
134602.csv 0.00691074 0.157398 0.0921498 0.126362 0.158668 -0.0753151 -0.164655 0.0883756 0.0407546 0.15239 -0.0145177 0.061617 -0.0891562 -0.0417054 -0.0858589 0.00102948 0.0161595 2.13553e-05 -0.0668119 0.0450828 0.117537 -0.0729031 -0.0580456 -0.00258632 -0.104359 0.136366 -0.144994 -0.12065 -0.121757 0.0830929 -0.16462 -0.0151503 0.0399056 0.160027 -0.0787732 -0.00789994 -0.094897 0.00608254 -0.0661624 0.129721 0.163127 -0.0793746 -0.0964145 0.0606208 0.0875067 0.0161015 -0.132051 -0.0491245 -0.154828 0.133222 -0.0687664 0.120808 -0.111705 -0.053042 -0.0912231 -0.111089 0.0443708 -0.139493 0.0607425 -0.161168 0.0786498 0.150048 0.146688 -0.0837242 -0.0553738 -0.117545 0.0986267 -0.0923841 0.098877 -0.12193 -0.062616 -0.0845228 -0.0636123 0.0823107 -0.0826875 0.139011 -0.0923962 0.0288433 0.137355 0.121588 -0.145517 0.160373 0.0628389 -0.0764258 -0.107213 0.0421445 0.137447 -0.0658571 0.0424128 0.0672861 0.109817 -0.126953 -0.0453275 0.0834503 0.0974179 0.00825522 -0.165445 -0.0213084 -0.0292943 -0.162938
Where the first word of each line is the name of each file, and what follows is the corresponding vector for that file. I need to save the vectors in this way to use an external software.


  • The algorithm ('Paragraph Vector') behind Doc2Vec makes use of randomness during initialization and training. Training also never reaches a point where all adjustments stop – just a point where it's believed that further updates will have negligible net value.

    So, identical texts won't achieve identical vectors – they're each being updated, alongside the model's internals, with each training cycle, against a slightly-different base model, with slightly different randomization choices. If you have enough data, good parameters, and are engaging in enough training, they should become very close. And, your downstream evaluations/uses should be tolerant of such small variances.

    Similarly, two runs on the same corpus won't result in identical end-vectors unless extreme care is taken to force determinism – for example by limiting training to a single worker thread, so that OS thread scheduling unpredictability doesn't slightly change the order of training examples. So vectors should only be compared if they were co-trained together, in the same model – and again, downstream applications should be tolerant of slight jitter from run to run or example to example.

    Other notes about your setup:

    • min_count=1 is almost always a bad choice - words with single (or few) examples just add noise to the training, making resulting vectors worse.

    • stochastic gradient descent optimization typically ends after the learning-rate alpha has been smoothly reduced to a tiny, near-zero value (such as 0.0001) – you're using a final alpha (0.01) that's a full 40% of the starting alpha.

    • you may also want to save your models using gensim's native .save(), because .save_word2vec_format() discards most model internals, and squashes the doc-vectors into the same namespace as any word-vectors.