Search code examples
pythonpython-2.7iteratorword2veclistiterator

Python File iterator running multiple times


I am using gensim to create a word2vec model of a sample file I have in a directory. I followed a tutorial online, which reads files in a directory and processes it line by line. My sample file has 9 lines in it. But this code gives my the same lines 9 times. Can someone please explain what's happening.

 class MySentences(object):
     def __init__(self, dirname):
         self.dirname = dirname   

     def __iter__(self): 
         for fname in os.listdir(self.dirname):
             for line in open(os.path.join(self.dirname, fname)):
                 print os.path.join(self.dirname, fname)
                 yield line.split() 

 sentences = MySentences('/fakepath/Folder')

Details: Suppose filename contains 3 lines like

hi how are you.
I am fine.
I am good.

line.split() should give me: ['hi','how','are','you'] only once. But this happens 3 times so I get the above list thrice instead of once. If the total sentences are 5, then it returns the line 5 times.


Solution

  • First you should figure out what you are trying to do. The class MySentences takes a directory as parameter and create a object sentences with a generator in it. So the sentences has a generator contains all lines in all the files in the directory.

    For example:

    for line in sentences:
        print(line)
    

    you will get a lot of lists with words as a element(I have removed the print statement that prints path) . Which is:

    ['hi', 'how', 'are', 'you.']
    

    ['I', 'am', 'fine.']

    ['I', 'am', 'good.']