I have two files that I want to compare. They are all pretty long, but their basic structure is this:
trigger_file_org = '''
10.792001 283292 30
11.286001 296136 9
11.792001 309292 130
17.898001 468048 23
18.390001 480840 9
18.896001 493996 123
24.988001 652388 73
25.482001 665232 9
25.988001 678388 173
34.026002 887376 10
34.518002 900168 9
35.024002 913324 110
40.676002 1060276 82
41.170002 1073120 9
41.676002 1086276 182
48.994002 1276544 43
49.488002 1289388 9
49.994002 1302544 143
56.032003 1459532 30
56.524003 1472324 9
57.032003 1485532 130
'''
trigger_file = trigger_file_org.readlines()
new_scenario_org = '''
30 7503
23 6412
73 1307
10 3901
82 4118
43 7404
30 3403
'''
scenario = new_scenario_org.readlines()
Now, the order of the two-digit codes in the first column of the scenario file is the same as the order of the two-digit codes in the last column of the trigger file (30 -> 23 -> 73 -> 10 -> 82 -> 43 -> 30), but in the trigger file there are other numbers in between, and the distance is not always the same. Moreover, the two-digit codes will repeat eventually, so they do not identify a row uniquely.
What I want to do is compare the lines of the two files in descending order, and when the two-digit codes from the trigger file are found and matched, I want the four-digit codes from the scenario file be attached to that line, like this:
10.792001 283292 30 7503
11.286001 296136 9
11.792001 309292 130
17.898001 468048 23 6412
18.390001 480840 9
18.896001 493996 123
24.988001 652388 73 1307
25.482001 665232 9
25.988001 678388 173
34.026002 887376 10 3901
34.518002 900168 9
35.024002 913324 110
40.676002 1060276 82 4118
41.170002 1073120 9
41.676002 1086276 182
48.994002 1276544 43 4704
49.488002 1289388 9
49.994002 1302544 143
56.032003 1459532 30 3403
56.524003 1472324 9
57.032003 1485532 130
So far the code I have is:
iterations = 0
trig_item_count = 0
for trig_i in range(len(trigger_file)):
curr_trigger_line = str.split(trigger_file[trig_i])
#print(curr_trigger_line)
if re.match('^[1-9][0-9]$', curr_trigger_line[2]):
trig_item_count = trig_item_count + 1
for sce_i in range(len(scenario)):
iterations = iterations + 1 # this is 129 600 total in the end bc it iterates through the trigger file and then the scenario file
curr_sce_line = str.split(scenario[sce_i])
if curr_trigger_line[2] == curr_sce_line[0]:
line_where_match__was_found = trig_i
if trig_i > line_where_match__was_found:
print("Hurray")
This code finds all the occurrences of the two-digit code, but it iterates through the entire scenario file every time. I understand why this is wrong, but I don't know how to tell Python to do the search in a descending order and to ignore the occurrences that have already been matched.
Any help is greatly appreciated!
trigger = '''\
10.792001 283292 30
11.286001 296136 9
11.792001 309292 130
17.898001 468048 23
18.390001 480840 9
18.896001 493996 123
24.988001 652388 73
25.482001 665232 9
25.988001 678388 173
34.026002 887376 10
34.518002 900168 9
35.024002 913324 110
40.676002 1060276 82
41.170002 1073120 9
41.676002 1086276 182
48.994002 1276544 43
49.488002 1289388 9
49.994002 1302544 143
56.032003 1459532 30
56.524003 1472324 9
57.032003 1485532 130
'''.splitlines()
scenario = '''\
30 7503
23 6412
73 1307
10 3901
82 4118
43 7404
30 3403
'''.splitlines()
it = iter(scenario)
out = []
for line in trigger:
if line[-3] == ' ' and line[-2] != ' ':
key = line[-2:]
try:
key2, extra = next(it).split()
except StopIteration:
raise ValueError('scenario ended too soon')
if key2 != key:
raise ValueError('scenario key does not match')
line += ' ' + extra
out.append(line)
for line in out:
print(line)
# matches your desired output as given