Search code examples
gensimword2vec

word2vec models consist of characters instead of words


I am trying to make a word2vec model by Gensim on Persian language which has "space" as the character delimiter, I use python 3.5. The problem that I encounter was I gave a text file as input and it returns a model which only consists of each character separately instead of words. I also gave the input as a list of words which is recommended on :

Python Gensim word2vec vocabulary key

It doesn't work for me and I think it doesn't consider sequence of words in a sentence so it wouldn't be correct.

I did some preprocessing on my input which consist of:

collapse multiple whitespaces into a single one
tokenize by splitting on whitespace
remove words less than 3 characters long remove stop words

I gave the text to word2vec which gave me result correctly, but I need it on python so my choice is limited to use Gensim.

Also I tried to load the model which made by word2vec source on gensim I get error so I need create the word2vec model by Gensim.

my code is:

  wfile = open('aggregate.txt','r')    
  wfileRead = wfile.read()    
  model = word2vec.Word2Vec(wfileRead , size=100)   
  model.save('Word2Vec.txt')

Solution

  • The gensim Word2Vec model does not expect strings as its text examples (sentences), but lists-of-tokens. Thus, it's up to your code to tokenize your text, before passing it to Word2Vec.

    Your code as shown just passes raw data from 'aggregate.txt' file into Word2Vec as wFileRead.

    Look at examples in the gensim documentation, including the LineSentence class included with gensim, for ideas