I am trying to make a word2vec model by Gensim on Persian language which has "space" as the character delimiter, I use python 3.5. The problem that I encounter was I gave a text file as input and it returns a model which only consists of each character separately instead of words. I also gave the input as a list of words which is recommended on :
Python Gensim word2vec vocabulary key
It doesn't work for me and I think it doesn't consider sequence of words in a sentence so it wouldn't be correct.
I did some preprocessing on my input which consist of:
collapse multiple whitespaces into a single one
tokenize by splitting on whitespace
remove words less than 3 characters long
remove stop words
I gave the text to word2vec which gave me result correctly, but I need it on python so my choice is limited to use Gensim.
Also I tried to load the model which made by word2vec source on gensim I get error so I need create the word2vec model by Gensim.
my code is:
wfile = open('aggregate.txt','r')
wfileRead = wfile.read()
model = word2vec.Word2Vec(wfileRead , size=100)
model.save('Word2Vec.txt')
The gensim Word2Vec model does not expect strings as its text examples (sentences), but lists-of-tokens. Thus, it's up to your code to tokenize your text, before passing it to Word2Vec.
Your code as shown just passes raw data from 'aggregate.txt' file into Word2Vec as wFileRead
.
Look at examples in the gensim documentation, including the LineSentence
class included with gensim, for ideas