python machine-learning nlp gensim training-data

python gensim word2vec gives typeerror TypeError: object of type 'generator' has no len() on custom dataclass

I am trying to get word2vec to work in python3, however as my dataset is too large to easily fit in memory I am loading it via an iterator (from zip files). However when I run it I get the error

Traceback (most recent call last):
  File "WordModel.py", line 85, in <module>
    main()
  File "WordModel.py", line 15, in main
    word2vec = gensim.models.Word2Vec(data,workers=cpu_count())
  File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/word2vec.py", line 783, in __init__
    fast_version=FAST_VERSION)
  File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 759, in __init__
    self.build_vocab(sentences=sentences, corpus_file=corpus_file, trim_rule=trim_rule)
  File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 936, in build_vocab
    sentences=sentences, corpus_file=corpus_file, progress_per=progress_per, trim_rule=trim_rule)
  File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/word2vec.py", line 1591, in scan_vocab
    total_words, corpus_count = self._scan_vocab(sentences, progress_per, trim_rule)
  File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/word2vec.py", line 1576, in _scan_vocab
    total_words += len(sentence)
TypeError: object of type 'generator' has no len()

Here is the code:

import zipfile
import os
from ast import literal_eval

from lxml import etree
import io
import gensim

from multiprocessing import cpu_count


def main():
    data = TrainingData("/media/thijser/Data/DataSets/uit2")
    print(len(data))
    word2vec = gensim.models.Word2Vec(data,workers=cpu_count())
    word2vec.save('word2vec.save')




class TrainingData:

    size=-1

    def __init__(self, dirname):
        self.data_location = dirname

    def __len__(self):
        if self.size<0: 

            for zipfile in self.get_zips_in_folder(self.data_location): 
                for text_file in self.get_files_names_from_zip(zipfile):
                    self.size=self.size+1
        return self.size            

    def __iter__(self): #might not fit in memory otherwise
        yield self.get_data()

    def get_data(self):


        for zipfile in self.get_zips_in_folder(self.data_location): 
            for text_file in self.get_files_names_from_zip(zipfile):
                yield self.preproccess_text(text_file)


    def stripXMLtags(self,text):

        tree=etree.parse(text)
        notags=etree.tostring(tree, encoding='utf8', method='text')
        return notags.decode("utf-8") 

    def remove_newline(self,text):
        text.replace("\\n"," ")
        return text

    def preproccess_text(self,text):
        text=self.stripXMLtags(text)
        text=self.remove_newline(text)

        return text




    def get_files_names_from_zip(self,zip_location):
        files=[]
        archive = zipfile.ZipFile(zip_location, 'r')

        for info in archive.infolist():
            files.append(archive.open(info.filename))

        return files

    def get_zips_in_folder(self,location):
       zip_files = []
       for root, dirs, files in os.walk(location):
            for name in files:
                if name.endswith((".zip")): 
                    filepath=root+"/"+name
                    zip_files.append(filepath)

       return zip_files

main()


for d in data:
    for dd in d :
        print(type(dd))

Does show me that dd is of the type string and contains the correct preprocessed strings (with length somewhere between 50 and 5000 words each).

Solution

Update after discussion:

Your TrainingData class __iter__() function isn't providing a generator which returns each text in turn, but rather a generator which returns a single other generator. (There's one too many levels of yield.) That's not what Word2Vec is expecting.

Changing the body of your __iter__() method to simply...

return self.get_data()

...so that __iter__() is a synonym for your get_data(), and just returns the same text-by-text generator that get_data() does, should help.

Original answer:

You're not showing the TrainingData.preproccess_text() (sic) method, referenced inside get_data(), which is what is actually creating the data Word2Vec is processing. And, it's that data that's generating the error.

Word2Vec requires its sentences corpus be an iterable sequence (for which a generator would be appropriate) where each individual item is a list-of-string-tokens.

From that error, it looks like the individual items in your TrainingData sequence may themselves be generators, rather than lists with a readable len().

(Separately, if perchance you're choosing to using generators there because the individual texts may be very very long, be aware that gensim Word2Vec and related classes only train on individual texts with a length up to 10000 word-tokens. Any words past the 10000th will be silently ignored. If that's a concern, your source texts should be pre-broken into individual texts of 10000 tokens or fewer.)