I am using gensim to create a word2vec model of a sample file I have in a directory. I followed a tutorial online, which reads files in a directory and processes it line by line. My sample file has 9 lines in it. But this code gives my the same lines 9 times. Can someone please explain what's happening.
class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname)):
print os.path.join(self.dirname, fname)
yield line.split()
sentences = MySentences('/fakepath/Folder')
Details: Suppose filename contains 3 lines like
hi how are you.
I am fine.
I am good.
line.split()
should give me: ['hi','how','are','you']
only once. But this happens 3 times so I get the above list thrice instead of once. If the total sentences are 5, then it returns the line 5 times.
First you should figure out what you are trying to do. The class MySentences
takes a directory as parameter and create a object sentences
with a generator in it. So the sentences
has a generator contains all lines in all the files in the directory.
For example:
for line in sentences:
print(line)
you will get a lot of lists with words as a element(I have removed the print statement that prints path) . Which is:
['hi', 'how', 'are', 'you.']
['I', 'am', 'fine.']
['I', 'am', 'good.']