Search code examples
pythonpython-3.xtext-mininggensimword2vec

Gensim Word2Vec Vocabulary: Unclear output


I'm starting to get familiar with Word2Vec, but I'm struggeling with a problem and coudln't find something similar... I want to use gensims Word2Vec on an imported PDF document (a book). To import I used PyPDF2 and stored the whole book into a list. Furthermore, I used gensims simple_preprocess in order to preprocess the data. This worked so far, I got the following output:

text=['schottky','diode','semiconductors',...]

So then I tried to use the Word2Vec:

from gensim.models import Word2Vec
model=Word2Vec(text, size=100, window=5, min_count=5, workers=4)
words=list(model.wv.vocab)

but the output was like this:

print(words)
['c','h','t','k','d',...]

I expected also the same words as in the text list and not just some characters. When I tried to find relations between words (e.g. 'schottky' and 'diode') I got the error-message that none of these words is included in the vocabulary.

My first thought was that the import is wrong, but I got the same result with textract instead of PyPDF2.

Does someone know what's the problem? Thanks for your help!

Appendix:

Importing the book

content_text=[] number_of_inputs=len(os.listdir(path))

    file_to_open=path
open_file=open(file_to_open,'rb')
read_pdf=PyPDF2.PdfFileReader(open_file)
number_of_pages=read_pdf.getNumPages()
page_content=""
for page_number in range(number_of_pages):
    page = read_pdf.getPage(page_number)
    page_content += page.extractText()
content_text.append(page_content)

Solution

  • Word2Vec requires as its sentences parameter a training corpus that is:

    • an iterable sequence (such as a list)
    • where each item is itself a list of string-tokens

    If you supply just a list-of-strings, each string is seen as a list-of-one-character-strings, resulting in all the one-letter words you're seeing.

    So, use a list-of-lists-of-words, more like:

    [
     ['schottky','diode','semiconductors'],
    ]
    

    (Note also that you generally won't get interesting Word2Vec results on tiny toy-sized data sets of just a few texts and just dozens to hundreds of words. You need many thousands of unique words, across many dozens of contrasting examples of each word, to induce the useful word-vector arrangements that Word2Vec is known for.)