I have two data files in python, each containing two-column data as below:
3023084 5764
9152549 5812
18461998 5808
45553152 5808
74141469 5753
106932238 5830
112230478 5795
135207137 5800
148813978 5802
154818883 5798
There are about 10M entries in each file (~400Mb).
I have to sort through each file and check if any number in the first column of one file matches any number in the first column in another file.
The code I currently have converted the files to lists:
ch1 = []
with open('ch1.txt', 'r+') as file:
for line in file:
if ':' not in line:
line = line.split()
ch1.append([line[0], line[1]])
ch2 = []
with open('ch2.txt', 'r+') as file:
for line in file:
if ':' not in line:
line = line.split()
ch2.append([line[0], line[1]])
I then iterate through both of the lists looking for a match. When a match is found I with to add the sum of the right hand columns to a new list 'coin'
coin = []
for item1 in ch1:
for item2 in ch2:
if item1[0] == item2[0]:
coin.append(int(item1[1]) + int(item2[1]))
The issue is this is taking a very long time and or crashing. Is there a more efficient way of running with?
There are lots of ways to improve this; for example:
Since you only scan through the contents of ch1.txt
once, you don't need to read it into a list, and should thus take up less memory, but probably won't speed things up all that much.
If you sort each of your lists, you can check for matches much more efficiently. Something like:
i1, i2 = 0, 0
while i1 < len(ch1) and i2 < len(ch2):
if ch1[i1][0] == ch2[i2][0]:
# Do what you do for matches
...
# Advance both indices
i1 += 1
i2 += 1
elif ch1[i1][0] < ch2[i2][0]:
# Advance index of the smaller value
i1 += 1
else: # ch1[i1][0] > ch2[i2][0]
i2 += 1
If the data in the files are already sorted, you can combine both ideas: instead of advancing an index, you simply read in the next line of the corresponding file. This should improve efficiency in time and space.