I have the first Harry Potter book in txt format. From this, I created two new txt files: in the first, all the occurrencies of Hermione
have been replaced with Hermione_1
; in the second, all the occurrencies of Hermione
have been replaced with Hermione_2
. Then I concatenated these 2 text to create one long text and I used this as input for Word2Vec.
This is my code:
import os
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
with open("HarryPotter1.txt", 'r') as original, \
open("HarryPotter1_1.txt", 'w') as mod1, \
open("HarryPotter1_2.txt", 'w') as mod2:
data=original.read()
data_1 = data.replace("Hermione", 'Hermione_1')
data_2 = data.replace("Hermione", 'Hermione_2')
mod1.write(data_1 + r"\n")
mod2.write(data_2 + r"\n")
with open("longText.txt",'w') as longFile:
with open("HarryPotter1_1.txt",'r') as textfile:
for line in textfile:
longFile.write(line)
with open("HarryPotter1_2.txt",'r') as textfile:
for line in textfile:
longFile.write(line)
model = ""
word_vectors = ""
modelName = "ModelTest"
vectorName = "WordVectorsTestst"
answer2 = raw_input("Overwrite embeddig? (yes or n)")
if(answer2 == 'yes'):
with open("longText.txt",'r') as longFile:
sentences = []
single= []
for line in longFile:
for word in line.split(" "):
single.append(word)
sentences.append(single)
model = Word2Vec(sentences,workers=4, window=5,min_count=5)
model.save(modelName)
model.wv.save_word2vec_format(vectorName+".bin",binary=True)
model.wv.save_word2vec_format(vectorName+".txt", binary=False)
model.wv.save(vectorName)
word_vectors = model.wv
else:
model = Word2Vec.load(modelName)
word_vectors = KeyedVectors.load_word2vec_format(vectorName + ".bin", binary=True)
print(model.wv.similarity("Hermione_1","Hermione_2"))
print(model.wv.distance("Hermione_1","Hermione_2"))
print(model.wv.most_similar("Hermione_1"))
print(model.wv.most_similar("Hermione_2"))
How is possible that model.wv.most_similar("Hermione_1")
and model.wv.most_similar("Hermione_2")
give me different output?
Their neighbour are completely different. This is the output of the four print:
0.00799602753634
0.992003972464
[('moments,', 0.3204237222671509), ('rose;', 0.3189219534397125), ('Peering', 0.3185565173625946), ('Express,', 0.31800806522369385), ('no...', 0.31678506731987), ('pushing', 0.3131707012653351), ('triumph,', 0.3116190731525421), ('no', 0.29974159598350525), ('them?"', 0.2927379012107849), ('first.', 0.29270970821380615)]
[('go?', 0.45812922716140747), ('magical', 0.35565727949142456), ('Spells."', 0.3554503619670868), ('Scabbets', 0.34701400995254517), ('cupboard."', 0.33982667326927185), ('dreadlocks', 0.3325180113315582), ('sickening', 0.32789379358291626), ('First,', 0.3245708644390106), ('met', 0.3223033547401428), ('built', 0.3218075931072235)]
Training word2Vec models is random to an extent. That is why you may get different results. Also, Hermione_2
starts appearing in the second half of the text data. In my understanding over the course of processing the data when the Hermione_1
context is already established and so is the vector for this word you introduce a second word in exactly the same context and the algorithm tries to find what differentiates the two.
Secondly, you use a very short vector which may under-represent the complexity of the conceptual space. Due to the simplifications you get two vectors without any overlap.