I have the following code and I think I am getting the vectors in a wrong way, because for example the vectors of two documents that are 100% identical are not the same.
def getDocs(corpusPath):
"""Function for processings documents as TaggedDocument"""
# Loop over all the files in corpus
for file in glob.glob(os.path.join(corpusPath, '*.csv')):
# getWords is a function that gets the words from the provided directory
# os.path.basename(file) takes the filename from the complete path
yield TaggedDocument(words=getWords(file), tags=[os.path.basename(file)])
def getModel(corpusPath, outputName):
# Get documents words from path
documents = getDocs(corpusPath)
cores = multiprocessing.cpu_count()
# Initialize the model
model = models.doc2vec.Doc2Vec(vector_size=100, epochs=10, min_count=1, max_vocab_size=None, alpha=0.025, min_alpha=0.01, workers=cores)
# Build Vocabulary
model.build_vocab(documents)
# Train the model
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)
# Save the model as shown below
model.save_word2vec_format(outputName, doctag_vec=True, word_vec=False, prefix="")
And the output has to be like this:
12571 100
134602.csv 0.00691074 0.157398 0.0921498 0.126362 0.158668 -0.0753151 -0.164655 0.0883756 0.0407546 0.15239 -0.0145177 0.061617 -0.0891562 -0.0417054 -0.0858589 0.00102948 0.0161595 2.13553e-05 -0.0668119 0.0450828 0.117537 -0.0729031 -0.0580456 -0.00258632 -0.104359 0.136366 -0.144994 -0.12065 -0.121757 0.0830929 -0.16462 -0.0151503 0.0399056 0.160027 -0.0787732 -0.00789994 -0.094897 0.00608254 -0.0661624 0.129721 0.163127 -0.0793746 -0.0964145 0.0606208 0.0875067 0.0161015 -0.132051 -0.0491245 -0.154828 0.133222 -0.0687664 0.120808 -0.111705 -0.053042 -0.0912231 -0.111089 0.0443708 -0.139493 0.0607425 -0.161168 0.0786498 0.150048 0.146688 -0.0837242 -0.0553738 -0.117545 0.0986267 -0.0923841 0.098877 -0.12193 -0.062616 -0.0845228 -0.0636123 0.0823107 -0.0826875 0.139011 -0.0923962 0.0288433 0.137355 0.121588 -0.145517 0.160373 0.0628389 -0.0764258 -0.107213 0.0421445 0.137447 -0.0658571 0.0424128 0.0672861 0.109817 -0.126953 -0.0453275 0.0834503 0.0974179 0.00825522 -0.165445 -0.0213084 -0.0292943 -0.162938
125202.csv 0.106642 0.167441 -0.0275412 0.130408 -0.107533 0.091452 0.0103496 -0.0214623 0.0873943 -0.0465384 -0.165227 -0.0540914 -0.00923723 0.175378 -0.051865 0.0107003 -0.179349 0.0683971 -0.159605 0.0644916 0.136338 0.111336 -0.0805002 0.00214934 -0.0490576 0.151279 -0.0397022 0.075442 -0.0278023 -0.0636982 0.174473 0.087985 -0.0714066 -0.0800442 -0.103995 -0.0228613 0.157171 -0.0678672 -0.161953 0.0839289 -0.155191 -0.00721683 0.0586751 -0.0474399 -0.122106 0.170611 0.157929 0.075531 -0.13505 0.093849 -0.119415 0.0386302 0.0139714 0.0756701 -0.0810199 -0.111754 0.112905 0.130293 -0.126257 -0.00654255 -0.0369909 -0.072449 0.0257127 0.0716955 0.103714 -0.0842208 -0.0534867 -0.095218 0.127797 -0.029322 0.161806 -0.177695 -0.0684089 0.0623551 0.06396 0.0828089 -0.0590939 0.0180832 -0.0591218 0.136139 -0.153984 0.108085 -0.127018 -0.0847872 -0.167081 0.0199622 0.0209045 0.0320618 0.0591803 0.0809688 0.0799196 0.15632 -0.0519707 0.0270171 -0.163197 -0.0846849 -0.176135 -0.0120047 -0.0697305 0.014441
116200.csv -0.0182099 -0.130409 -0.138414 -0.0310527 -0.0274882 -0.0711805 -0.0628653 -0.144249 -0.166021 -0.0242265 -0.130593 -0.141916 0.0119525 0.0500143 -0.147568 -0.036778 0.110357 0.0439302 -0.132496 -0.105203 0.0356234 0.0982645 0.134903 -0.0648039 -0.0566216 0.138991 -0.0467151 -0.140643 0.139711 0.0943256 0.0576583 0.0644239 0.00136725 -0.0296913 0.0612566 0.148131 0.067239 0.100442 0.0665155 0.104861 -0.0498524 0.0995954 -0.115922 -0.00524584 0.0491675 0.159028 0.132554 0.0479373 0.141164 0.173129 0.022317 -0.000446397 0.0867293 -0.155649 -0.0675728 -0.0981307 -0.0806008 -0.0107237 -0.103454 -0.0753868 -0.0551634 0.170743 0.0495554 0.11536 -0.0294355 0.061617 0.126016 -0.04804 -0.0315217 -0.169522 -0.0892494 -0.025444 0.0672556 0.166157 0.0647261 0.0944827 -0.0792354 0.0182105 0.118192 0.000124603 -0.10565 -0.155033 0.107355 0.150469 -0.104327 -0.162604 -0.0218357 0.145972 -0.145784 -0.00176559 0.153054 -0.16377 -0.11736 0.0892985 -0.0212026 0.0511168 -0.146278 -0.0134697 -0.0540684 0.0791529
148597.csv -0.15473 0.0955252 0.0432369 -0.0945614 0.136283 -0.102851 0.0847211 -0.0396431 -0.0467567 0.17154 0.153097 0.0693114 0.163837 0.135897 0.146128 -0.167215 -0.152268 -0.11602 0.0282252 -0.0779752 -0.0829204 0.018318 0.00621094 0.0707405 0.0968831 0.00652018 -0.0568833 0.0916579 -0.0400151 -0.0391421 -0.0548217 -0.173926 -0.110223 -0.0317329 -0.02952 -0.129147 0.0698902 -0.154276 -0.157658 -0.14261 0.032107 -0.0385964 -0.0587693 0.0212146 0.143626 0.142041 -0.0530896 -0.133748 0.131452 0.13672 0.148338 0.160325 -0.113424 0.0678939 -0.0229337 -0.170486 -0.156904 0.0710402 0.00277802 0.120395 0.0360002 -0.0593753 0.155915 -0.0620641 -0.112055 0.0153659 0.147731 -0.0249911 0.0360584 -0.0402479 0.022273 0.00174414 -0.0178126 -0.116679 0.0191754 -0.0089874 0.083151 -0.168562 -0.160357 -0.0659622 0.0248376 0.045583 0.127733 -0.0675122 -0.0734585 0.113653 0.166756 0.0723445 0.0554671 -0.0751338 0.0481711 -0.00127609 0.0560728 0.124651 -0.0495638 0.0985305 -0.110315 0.0672438 0.096637 0.104245
166916.csv 0.168698 0.0629846 0.0248923 -0.105248 0.172408 -0.0322083 0.174124 -0.113572 -0.0104922 0.0429484 -0.0306917 0.022368 -0.0584265 0.0337984 -0.0225754 0.143456 -0.121288 -0.133673 0.0677091 0.0583681 0.0390327 -0.141176 0.0694527 -0.0290526 -0.129707 -0.0765447 0.071578 0.146411 -0.112526 0.103688 -0.110703 0.0781341 0.0318269 0.105218 0.0177797 0.123248 0.158062 0.0370042 -0.137394 0.0246147 0.00653834 0.166063 -0.100149 -0.0479191 -0.0702838 0.0690037 0.114349 -0.0274343 0.014801 -0.0421596 0.0694873 0.0662955 -0.12477 -0.0088994 0.104959 0.149459 0.16611 0.0265376 -0.134808 0.101123 0.0431258 0.0584757 -0.0315779 0.121671 -0.0380923 -0.0897689 -0.0237933 0.110452 -0.0039647 0.106183 -0.165717 -0.16557 0.136988 0.121843 0.0722612 -0.00844494 0.175932 -0.0751714 0.152611 -0.0646956 0.105122 -0.108245 0.0583691 0.113012 0.171521 -0.0258976 0.0851889 -0.0941529 0.153386 0.0455267 -0.0259182 -0.0437207 -0.150415 0.132313 -0.143572 -0.0281547 -0.00231613 -0.00760185 -0.147233 -0.167408
148291.csv 0.00976907 0.168438 -0.0919878 -0.164332 -0.138181 -0.149775 -0.0394723 0.027946 0.0662307 -0.00850593 0.12174 0.106023 -0.11512 0.0694538 0.128228 0.066019 0.0805346 0.00220964 -0.0465066 0.0923588 0.121286 0.168551 0.0462572 0.0221805 -0.119831 0.00797117 -0.00709804 -0.0222688 0.0938169 0.100695 0.133902 0.15964 0.0544278 -0.0504766 -0.0539783 -0.0158389 0.0280565 -0.10531 0.112356 -0.0349924 0.155673 0.0491142 0.171533 -0.044268 0.0560867 -0.135758 0.114202 -0.120608 0.0373457 -0.0847815 0.0285375 -0.0101114 0.0169282 -0.00141743 -0.028344 -0.00979434 -0.0599551 0.0554465 -0.0583942 -0.169627 0.167471 -0.00661054 0.114252 -0.00489984 0.167312 0.144928 0.0376684 -0.118885 0.0426739 0.169052 0.00265325 0.146609 0.163534 -0.100965 -0.101386 0.127619 0.148285 -0.0881821 -0.100448 -0.044064 0.106071 0.0239426 0.0733384 -0.0962991 0.0939341 0.0659483 0.122844 -0.140426 -0.0485195 0.0645185 0.037179 0.0963829 -0.109955 -0.151168 -0.0413991 -0.0556731 -0.173456 -0.167728 -0.128145 0.150923
...
Where the first word of each line is the name of each file, and what follows is the corresponding vector for that file. I need to save the vectors in this way to use an external software.
The algorithm ('Paragraph Vector') behind Doc2Vec
makes use of randomness during initialization and training. Training also never reaches a point where all adjustments stop – just a point where it's believed that further updates will have negligible net value.
So, identical texts won't achieve identical vectors – they're each being updated, alongside the model's internals, with each training cycle, against a slightly-different base model, with slightly different randomization choices. If you have enough data, good parameters, and are engaging in enough training, they should become very close. And, your downstream evaluations/uses should be tolerant of such small variances.
Similarly, two runs on the same corpus won't result in identical end-vectors unless extreme care is taken to force determinism – for example by limiting training to a single worker thread, so that OS thread scheduling unpredictability doesn't slightly change the order of training examples. So vectors should only be compared if they were co-trained together, in the same model – and again, downstream applications should be tolerant of slight jitter from run to run or example to example.
Other notes about your setup:
min_count=1
is almost always a bad choice - words with single (or few) examples just add noise to the training, making resulting vectors worse.
stochastic gradient descent optimization typically ends after the learning-rate alpha
has been smoothly reduced to a tiny, near-zero value (such as 0.0001
) – you're using a final alpha (0.01
) that's a full 40% of the starting alpha.
you may also want to save your models using gensim's native .save()
, because .save_word2vec_format()
discards most model internals, and squashes the doc-vectors into the same namespace as any word-vectors.