I'm trying to read a large .log file (having more than sixty thousand lines) into memory. I want to apply Word2Vec algorithm implemented in gensim. I have tried number of solutions, but none of them seems to be working. Any help would be appreciated.
Code1:
def file_reader(file_obj):
return [word for line in open(file_obj, 'r') for word in line.split()]
Code2:
for i,line in enumerate(open(file_obj,'r')):
print(i,line)
sentences += line
You need to chunk the file somehow. Since your file is line based, you can use python's normal line chunking (ie for line in file
). The problem you're running into is that this
def file_reader(file_obj):
return [word for line in open(file_obj, 'r') for word in line.split()]
loads the whole file into your return statement.
Instead of doing this, you need to assemble the vector at the same time you read the line. As you encounter each word, do your stop-word removal and lemitization right there, and if there's anything left, add it to your vector.
Or, process it sentence-by-sentence if you need more context to each word. In either case, do the processing in your reader as it's read in rather than gathering all the data from the file and then processing it.