I have 2 big files: the first one (10GB) contains text with occurrencies of keys in a specific format {keyX}
and the second one (3GB) contains the mapping between keys and their values (45 milion entries).
file1:
Lorem ipsum {key1} sit amet, consectetur {key41736928} elit, ...
file2:
{key1} dolor
...
{key41736928} adipiscing
...
Considering the dimension of the second file I can't load all the key-value pairs in memory but I cannot search in the entire second file for every key's occurrence.
How can I substitute all the keys in the first file with the relative values in the second file in a decent amount of time?
Use a binary search in the second file. It is ordered by key so the best you can do is a log(n) search.
def get_row_by_id(searched_row_id):
step = os.path.getsize(mid_name_file) / 2.
step_dimension = step
last_row_id = ""
with open(mid_name_file, 'r') as f:
while True:
f.seek(int(step), 0) # absolute position
seek_to(f, '\n')
row = parse_row(f.readline())
row_id = row[0]
if row_id == last_row_id:
raise ValueError(searched_row_id)
else:
last_row_id = row_id
if row_id == searched_row_id:
return row[1]
elif searched_row_id < row_id:
step_dimension /= 2.
step = step - step_dimension
else:
step_dimension /= 2.
step = step + step_dimension
def seek_to(f, c):
while f.read(1) != c:
f.seek(-2, 1)
def parse_row(row):
return row.split('\t')[0], row