Search code examples
pythonloopsparsinglarge-data

Finding identical numbers in large files python


I have two data files in python, each containing two-column data as below:

3023084 5764
9152549 5812
18461998 5808
45553152 5808
74141469 5753
106932238 5830
112230478 5795
135207137 5800
148813978 5802
154818883 5798

There are about 10M entries in each file (~400Mb).

I have to sort through each file and check if any number in the first column of one file matches any number in the first column in another file.

The code I currently have converted the files to lists:

ch1 = []
with open('ch1.txt', 'r+') as file: 
    for line in file: 
        if ':' not in line:
            line = line.split() 
            ch1.append([line[0], line[1]])

ch2 = []
with open('ch2.txt', 'r+') as file: 
    for line in file: 
        if ':' not in line:
            line = line.split() 
            ch2.append([line[0], line[1]])

I then iterate through both of the lists looking for a match. When a match is found I with to add the sum of the right hand columns to a new list 'coin'

coin = []
for item1 in ch1: 
    for item2 in ch2: 
        if item1[0] == item2[0]:
            coin.append(int(item1[1]) + int(item2[1]))

The issue is this is taking a very long time and or crashing. Is there a more efficient way of running with?


Solution

  • There are lots of ways to improve this; for example:

    • Since you only scan through the contents of ch1.txt once, you don't need to read it into a list, and should thus take up less memory, but probably won't speed things up all that much.

    • If you sort each of your lists, you can check for matches much more efficiently. Something like:

        i1, i2 = 0, 0
        while i1 < len(ch1) and i2 < len(ch2):
            if ch1[i1][0] == ch2[i2][0]:
                # Do what you do for matches
                ...
                # Advance both indices
                i1 += 1
                i2 += 1
            elif ch1[i1][0] < ch2[i2][0]:
                # Advance index of the smaller value
                i1 += 1
            else: # ch1[i1][0] > ch2[i2][0]
                i2 += 1
    

    If the data in the files are already sorted, you can combine both ideas: instead of advancing an index, you simply read in the next line of the corresponding file. This should improve efficiency in time and space.