Search code examples
nlpgensimword2vecfeature-extractiontext-classification

I get 'single' characters as learned vocabulary on word2vec genism as an output


I am new for word2vec and I have trained a text file via word2vec for feature extraction than when I look at the words that are trained I found that it is single characters instead of words, what did I miss here? anyone help

I try to feed tokens instead of the raw text into the models

import nltk

from pathlib import Path
data_folder = Path("")
file_to_open = data_folder / "test.txt"
#read the file
file = open(file_to_open , "rt")
raw_text = file.read()
file.close()

#tokenization
token_list = nltk.word_tokenize(raw_text)

#Remove Punctuation
from nltk.tokenize import punkt
token_list2 = list(filter(lambda token : punkt.PunktToken(token).is_non_punct,token_list))
#upper to lower case
token_list3 = [word.lower() for word in token_list2]
#remove stopwords
from nltk.corpus import stopwords
token_list4 = list(filter(lambda token: token not in stopwords.words("english"),token_list3))

#lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
token_list5 = [lemmatizer.lemmatize(word) for word in token_list4]
print("Final Tokens are :")
print(token_list5,"\n")
print("Total tokens : ", len(token_list5))

#word Embedding
from gensim.models import Word2Vec
# train model
model = Word2Vec(token_list5, min_count=2)
# summarize the loaded model

    print("The model is :")
    print(model,"\n")`enter code here`

# summarize vocabulary

    words = list(model.wv`enter code here`.vocab)
    print("The learned vocabulary words are : \n",words)

Output- ['p', 'o', 't', 'e', 'n', 'i', 'a', 'l', 'r', 'b', 'u', 'm', 'h', 'd', 'c', 's', 'g', 'q', 'f', 'w', '-']
Expected -[ 'potenial', 'xyz','etc']

Solution

  • Word2Vec needs its training corpus to be a sequence where each item (text/sentence) is a list-of-string-tokens.

    If you instead pass texts that are raw strings, each will appear as a list-of-one-character-tokens, and that will result in the final vocabulary you're seeing, where all learned 'words' are just single-characters.

    So, take a closer look at your token_list5 variable. As it is a list, what is token_list5[0]? (Is it a list-of-strings?) What is token_list5[0][0]? (Is it a full word?)