text replace data-manipulation file-manipulation bigdata

Substitute key's occurrencies in a file with corresponding value in another file

I have 2 big files: the first one (10GB) contains text with occurrencies of keys in a specific format {keyX} and the second one (3GB) contains the mapping between keys and their values (45 milion entries).

file1:

Lorem ipsum {key1} sit amet, consectetur {key41736928} elit, ...

file2:

{key1} dolor
...
{key41736928} adipiscing
...

Considering the dimension of the second file I can't load all the key-value pairs in memory but I cannot search in the entire second file for every key's occurrence.

How can I substitute all the keys in the first file with the relative values in the second file in a decent amount of time?

Solution

Use a binary search in the second file. It is ordered by key so the best you can do is a log(n) search.

def get_row_by_id(searched_row_id):
    step = os.path.getsize(mid_name_file) / 2.
    step_dimension = step
    last_row_id = ""

    with open(mid_name_file, 'r') as f:
        while True:
            f.seek(int(step), 0)  # absolute position
            seek_to(f, '\n')
            row = parse_row(f.readline())
            row_id = row[0]

            if row_id == last_row_id:
                raise ValueError(searched_row_id)
            else:
                last_row_id = row_id

            if row_id == searched_row_id:
                return row[1]
            elif searched_row_id < row_id:
                step_dimension /= 2.
                step = step - step_dimension
            else:
                step_dimension /= 2.
                step = step + step_dimension


def seek_to(f, c):
    while f.read(1) != c:
        f.seek(-2, 1)


def parse_row(row):
    return row.split('\t')[0], row