Search code examples
textreplacedata-manipulationfile-manipulationbigdata

Substitute key's occurrencies in a file with corresponding value in another file


I have 2 big files: the first one (10GB) contains text with occurrencies of keys in a specific format {keyX} and the second one (3GB) contains the mapping between keys and their values (45 milion entries).

file1:

Lorem ipsum {key1} sit amet, consectetur {key41736928} elit, ...

file2:

{key1} dolor
...
{key41736928} adipiscing
...

Considering the dimension of the second file I can't load all the key-value pairs in memory but I cannot search in the entire second file for every key's occurrence.

How can I substitute all the keys in the first file with the relative values in the second file in a decent amount of time?


Solution

  • Use a binary search in the second file. It is ordered by key so the best you can do is a log(n) search.

    def get_row_by_id(searched_row_id):
        step = os.path.getsize(mid_name_file) / 2.
        step_dimension = step
        last_row_id = ""
    
        with open(mid_name_file, 'r') as f:
            while True:
                f.seek(int(step), 0)  # absolute position
                seek_to(f, '\n')
                row = parse_row(f.readline())
                row_id = row[0]
    
                if row_id == last_row_id:
                    raise ValueError(searched_row_id)
                else:
                    last_row_id = row_id
    
                if row_id == searched_row_id:
                    return row[1]
                elif searched_row_id < row_id:
                    step_dimension /= 2.
                    step = step - step_dimension
                else:
                    step_dimension /= 2.
                    step = step + step_dimension
    
    
    def seek_to(f, c):
        while f.read(1) != c:
            f.seek(-2, 1)
    
    
    def parse_row(row):
        return row.split('\t')[0], row