i have a relatively large text file (around 7m lines) and i want to run a specific logic over it which i ll try to explain below:
A1KEY1
A2KEY1
B1KEY2
C1KEY3
D1KEY3
E1KEY4
I want to count the frequency of appearence of the keys, and then output those with a frequency of 1 into one text file, those with a frequency of 2 in another, and those with a frequency higher than 2 in another.
This is the code i have so far, but it iterates over the dictionary painfully slow, and it gets slower the more it progresses.
def filetoliststrip(file):
file_in = str(file)
lines = list(open(file_in, 'r'))
content = [x.strip() for x in lines]
return content
dict_in = dict()
seen = []
fileinlist = filetoliststrip(file_in)
out_file = open(file_ot, 'w')
out_file2 = open(file_ot2, 'w')
out_file3 = open(file_ot3, 'w')
counter = 0
for line in fileinlist:
counter += 1
keyf = line[10:69]
print("Loading line " + str(counter) + " : " + str(line))
if keyf not in dict_in.keys():
dict_in[keyf] = []
dict_in[keyf].append(1)
dict_in[keyf].append(line)
else:
dict_in[keyf][0] += 1
dict_in[keyf].append(line)
for j in dict_in.keys():
print("Processing key: " + str(j))
#print(dict_in[j])
if dict_in[j][0] < 2:
out_file.write(str(dict_in[j][1]))
elif dict_in[j][0] == 2:
for line_in in dict_in[j][1:]:
out_file2.write(str(line_in) + "\n")
elif dict_in[j][0] > 2:
for line_in in dict_in[j][1:]:
out_file3.write(str(line_in) + "\n")
out_file.close()
out_file2.close()
out_file3.close()
I m running this on a windows PC i7 with 8GB Ram, this should be not taking hours to perform. Is this a problem with the way i read the file into a list? Should i use a different method? Thanks in advance.
You have multiple points that slow down your code - there is no need to load the whole file into memory only to iterate over it again, there is no need to get a list of keys each time you want to do a lookup (if key not in dict_in: ...
will suffice and will be blazingly fast), you don't need to keep the line count as you can post-check the lines length anyway... to name but a few.
I'd completely restructure your code as:
import collections
dict_in = collections.defaultdict(list) # save some time with a dictionary factory
with open(file_in, "r") as f: # open the file_in for reading
for line in file_in: # read the file line by line
key = line.strip()[10:69] # assuming this is how you get your key
dict_in[key].append(line) # add the line as an element of the found key
# now that we have the lines in their own key brackets, lets write them based on frequency
with open(file_ot, "w") as f1, open(file_ot2, "w") as f2, open(file_ot3, "w") as f3:
selector = {1: f1, 2: f2} # make our life easier with a quick length-based lookup
for values in dict_in.values(): # use dict_in.itervalues() on Python 2.x
selector.get(len(values), f3).writelines(values) # write the collected lines
And you'll hardly get more efficient than that, at least in Python.
Keep in mind that this will not guarantee the order of lines in the output prior to Python 3.7 (or CPython 3.6). The order within a key itself will be preserved, however. If you need to keep the line order prior to the aforementioned Python versions you'll have to do keep a separate key order list and iterate over it to pick up the dict_in
values in order.