I am trying to get word2vec to work in python3, however as my dataset is too large to easily fit in memory I am loading it via an iterator (from zip files). However when I run it I get the error
Traceback (most recent call last):
File "WordModel.py", line 85, in <module>
main()
File "WordModel.py", line 15, in main
word2vec = gensim.models.Word2Vec(data,workers=cpu_count())
File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/word2vec.py", line 783, in __init__
fast_version=FAST_VERSION)
File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 759, in __init__
self.build_vocab(sentences=sentences, corpus_file=corpus_file, trim_rule=trim_rule)
File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 936, in build_vocab
sentences=sentences, corpus_file=corpus_file, progress_per=progress_per, trim_rule=trim_rule)
File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/word2vec.py", line 1591, in scan_vocab
total_words, corpus_count = self._scan_vocab(sentences, progress_per, trim_rule)
File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/word2vec.py", line 1576, in _scan_vocab
total_words += len(sentence)
TypeError: object of type 'generator' has no len()
Here is the code:
import zipfile
import os
from ast import literal_eval
from lxml import etree
import io
import gensim
from multiprocessing import cpu_count
def main():
data = TrainingData("/media/thijser/Data/DataSets/uit2")
print(len(data))
word2vec = gensim.models.Word2Vec(data,workers=cpu_count())
word2vec.save('word2vec.save')
class TrainingData:
size=-1
def __init__(self, dirname):
self.data_location = dirname
def __len__(self):
if self.size<0:
for zipfile in self.get_zips_in_folder(self.data_location):
for text_file in self.get_files_names_from_zip(zipfile):
self.size=self.size+1
return self.size
def __iter__(self): #might not fit in memory otherwise
yield self.get_data()
def get_data(self):
for zipfile in self.get_zips_in_folder(self.data_location):
for text_file in self.get_files_names_from_zip(zipfile):
yield self.preproccess_text(text_file)
def stripXMLtags(self,text):
tree=etree.parse(text)
notags=etree.tostring(tree, encoding='utf8', method='text')
return notags.decode("utf-8")
def remove_newline(self,text):
text.replace("\\n"," ")
return text
def preproccess_text(self,text):
text=self.stripXMLtags(text)
text=self.remove_newline(text)
return text
def get_files_names_from_zip(self,zip_location):
files=[]
archive = zipfile.ZipFile(zip_location, 'r')
for info in archive.infolist():
files.append(archive.open(info.filename))
return files
def get_zips_in_folder(self,location):
zip_files = []
for root, dirs, files in os.walk(location):
for name in files:
if name.endswith((".zip")):
filepath=root+"/"+name
zip_files.append(filepath)
return zip_files
main()
for d in data:
for dd in d :
print(type(dd))
Does show me that dd is of the type string and contains the correct preprocessed strings (with length somewhere between 50 and 5000 words each).
Update after discussion:
Your TrainingData
class __iter__()
function isn't providing a generator which returns each text in turn, but rather a generator which returns a single other generator. (There's one too many levels of yield
.) That's not what Word2Vec
is expecting.
Changing the body of your __iter__()
method to simply...
return self.get_data()
...so that __iter__()
is a synonym for your get_data()
, and just returns the same text-by-text generator that get_data()
does, should help.
Original answer:
You're not showing the TrainingData.preproccess_text()
(sic) method, referenced inside get_data()
, which is what is actually creating the data Word2Vec
is processing. And, it's that data that's generating the error.
Word2Vec
requires its sentences
corpus be an iterable sequence (for which a generator would be appropriate) where each individual item is a list-of-string-tokens.
From that error, it looks like the individual items in your TrainingData
sequence may themselves be generators, rather than lists with a readable len()
.
(Separately, if perchance you're choosing to using generators there because the individual texts may be very very long, be aware that gensim Word2Vec
and related classes only train on individual texts with a length up to 10000 word-tokens. Any words past the 10000th will be silently ignored. If that's a concern, your source texts should be pre-broken into individual texts of 10000 tokens or fewer.)