I'm starting to get familiar with Word2Vec, but I'm struggeling with a problem and coudln't find something similar... I want to use gensims Word2Vec on an imported PDF document (a book). To import I used PyPDF2 and stored the whole book into a list. Furthermore, I used gensims simple_preprocess in order to preprocess the data. This worked so far, I got the following output:
text=['schottky','diode','semiconductors',...]
So then I tried to use the Word2Vec:
from gensim.models import Word2Vec
model=Word2Vec(text, size=100, window=5, min_count=5, workers=4)
words=list(model.wv.vocab)
but the output was like this:
print(words)
['c','h','t','k','d',...]
I expected also the same words as in the text list and not just some characters. When I tried to find relations between words (e.g. 'schottky' and 'diode') I got the error-message that none of these words is included in the vocabulary.
My first thought was that the import is wrong, but I got the same result with textract instead of PyPDF2.
Does someone know what's the problem? Thanks for your help!
Appendix:
Importing the book
content_text=[] number_of_inputs=len(os.listdir(path))
file_to_open=path
open_file=open(file_to_open,'rb')
read_pdf=PyPDF2.PdfFileReader(open_file)
number_of_pages=read_pdf.getNumPages()
page_content=""
for page_number in range(number_of_pages):
page = read_pdf.getPage(page_number)
page_content += page.extractText()
content_text.append(page_content)
Word2Vec
requires as its sentences
parameter a training corpus that is:
If you supply just a list-of-strings, each string is seen as a list-of-one-character-strings, resulting in all the one-letter words you're seeing.
So, use a list-of-lists-of-words, more like:
[
['schottky','diode','semiconductors'],
]
(Note also that you generally won't get interesting Word2Vec
results on tiny toy-sized data sets of just a few texts and just dozens to hundreds of words. You need many thousands of unique words, across many dozens of contrasting examples of each word, to induce the useful word-vector arrangements that Word2Vec
is known for.)