python performance iteration large-files

How to improve Python iteration performance over large files

I have a reference file that is about 9,000 lines and has the following structure: (index, size) - where index is unique but size may not be.

And I have a data file that is about 650,000 lines and has the following structure: (cluster, offset, size) - where offset is unique but size is not.

446 0xdf6ad1 34572
447 0xdf8020 132484
451 0xe1871b 11044
451 0xe1b394 7404
451 0xe1d12b 5892
451 0xe1e99c 5692
452 0xe20092 6224
452 0xe21a4b 5428
452 0xe23029 5104
452 0xe2455e 138136

I need to compare each size value in the second column of the reference file for any matches with the size values in the third column of the data file. If there is a match, output the offset hex value (second column in the data file) with the index value (first column in the reference file). Currently I am doing this with the following code and just piping it to a new file:

#!/usr/bin/python3

import sys

ref_file = sys.argv[1]
dat_file = sys.argv[2]

with open(ref_file, 'r') as ref, open(dat_file, 'r') as dat:

    for r_line in ref:
        ref_size = r_line[r_line.find(' ') + 1:-1]

        for d_line in dat:
            dat_size = d_line[d_line.rfind(' ') + 1:-1]
            if dat_size == ref_size:
                print(d_line[d_line.find('0x') : d_line.rfind(' ')]
                      + '\t'
                      + r_line[:r_line.find(' ')])
        dat.seek(0)

The typical output looks like this:

0x86ece1eb  0
0x16ff4628f 0
0x59b358020 0
0x27dfa8cb4 1
0x6f98eb88f 1
0x102cb10d4 2
0x18e2450c8 2
0x1a7aeed12 2
0x6cbb89262 2
0x34c8ad5   3
0x1c25c33e5 3

This works fine but takes about 50mins to complete for the given file sizes.

It has done it's job, but as a novice I am always keen to learn ways to improve my coding and share these learnings. My question is, what changes could I make to improve the performance of this code?

Solution

Since you look up lines in the files by size, these sizes should be the keys in any dictionary data structure. This dictionary you will need to get rid of the nested loop which is the real performance killer here. Furthermore, as your sizes are not unique, you will have to use lists of offset / index values (depending on which file you want store in the dictionary). A defaultdict will help you avoid some clunky code:

from collections import defaultdict

with open(ref_file, 'r') as ref, open(dat_file, 'r') as dat:
    dat_dic = defaultdict(list)  # maintain a list of offsets for each size
    for d_line in dat:
        _, offset, size = d_line.split()
        dat_dic[size].append(offset)

    for r_line in ref:
        index, size = r_line.split()
        for offset in dat_dic[size]:  
            # dict lookup is O(1) and not O(N) ...
            # ... as looping over the dat_file is
            print('{offset}\t{index}'.format(offset=offset, index=index))

If the order of your output lines does not matter you can think about doing it the other way around because your dat_file is so much bigger and thus building the defaultdict from it uses a lot more RAM.