Im working with python 3.5 and Im writing a script that handles large spreadsheet files. Each row of the spreadsheet contains a phrase and several other relevant values. I'm parsing the file as a matrix, but for the example file, it has over 3000 rows (and even larger files should be within expected). I also have a list of 100 words. I need to search for each word, which row of the matrix contains it in its string, and print the some averages based on that.
Currently I'm iterating over each row of the matrix, and then check if the string contains any of the mentioned words, but this process takes 3000 iterations, with 100 checks for each one. Is there any better way to accomplish this task?
In the long run, I would encourage you to use something more suitable for the task. A SQL database, for instance.
But if you stick with writing your own python solution, here are some things you can do to optimize it:
Use sets. Sets have a very efficient membership check.
wordset_100 = set(worldlist_100)
for row in data_3k:
word_matches = wordset_100.intersect(row.phrase.split(" "))
for match in word_matches:
# add to accumulator
# this loop will be run less than len(row.phrase.split(' ')) times
from multiprocessing import Pool
from collections import defaultdict
def matches(wordset_100, row):
return wordset_100.intersect(row.phrase.split(" ")), row
if __name__ == "__main__":
accu = defaultdict(int)
p = Pool()
wordset_100 = set(worldlist_100)
for m, r in, data_3k):
for word in m:
accu[word] += r.number