Search code examples
regexlistpython-3.xnested-loops

How to compare two files in Python when line indices differ and regular expression has multiple matches?


I have two files that I want to compare. They are all pretty long, but their basic structure is this:

trigger_file_org = '''
   10.792001    283292  30
   11.286001    296136   9
   11.792001    309292 130
   17.898001    468048  23
   18.390001    480840   9
   18.896001    493996 123
   24.988001    652388  73
   25.482001    665232   9
   25.988001    678388 173
   34.026002    887376  10
   34.518002    900168   9
   35.024002    913324 110
   40.676002   1060276  82
   41.170002   1073120   9
   41.676002   1086276 182
   48.994002   1276544  43
   49.488002   1289388   9
   49.994002   1302544 143
   56.032003   1459532  30
   56.524003   1472324   9
   57.032003   1485532 130
       '''
trigger_file = trigger_file_org.readlines()  

new_scenario_org = '''
30 7503
23 6412
73 1307
10 3901
82 4118
43 7404
30 3403
'''
scenario = new_scenario_org.readlines()

Now, the order of the two-digit codes in the first column of the scenario file is the same as the order of the two-digit codes in the last column of the trigger file (30 -> 23 -> 73 -> 10 -> 82 -> 43 -> 30), but in the trigger file there are other numbers in between, and the distance is not always the same. Moreover, the two-digit codes will repeat eventually, so they do not identify a row uniquely.

What I want to do is compare the lines of the two files in descending order, and when the two-digit codes from the trigger file are found and matched, I want the four-digit codes from the scenario file be attached to that line, like this:

   10.792001    283292  30 7503
   11.286001    296136   9
   11.792001    309292 130
   17.898001    468048  23 6412
   18.390001    480840   9
   18.896001    493996 123
   24.988001    652388  73 1307
   25.482001    665232   9
   25.988001    678388 173
   34.026002    887376  10 3901
   34.518002    900168   9
   35.024002    913324 110
   40.676002   1060276  82 4118
   41.170002   1073120   9
   41.676002   1086276 182
   48.994002   1276544  43 4704
   49.488002   1289388   9
   49.994002   1302544 143
   56.032003   1459532  30 3403
   56.524003   1472324   9
   57.032003   1485532 130

So far the code I have is:

iterations = 0
trig_item_count = 0

    for trig_i in range(len(trigger_file)):
        curr_trigger_line = str.split(trigger_file[trig_i])
        #print(curr_trigger_line)
        if re.match('^[1-9][0-9]$', curr_trigger_line[2]):
            trig_item_count = trig_item_count + 1
            for sce_i in range(len(scenario)):
                iterations = iterations + 1   # this is 129 600 total in the end bc it iterates through the trigger file and then the scenario file
                curr_sce_line = str.split(scenario[sce_i])
                if curr_trigger_line[2] == curr_sce_line[0]:
                    line_where_match__was_found = trig_i
                    if trig_i > line_where_match__was_found:
                    print("Hurray")

This code finds all the occurrences of the two-digit code, but it iterates through the entire scenario file every time. I understand why this is wrong, but I don't know how to tell Python to do the search in a descending order and to ignore the occurrences that have already been matched.

Any help is greatly appreciated!


Solution

  • trigger = '''\
       10.792001    283292  30
       11.286001    296136   9
       11.792001    309292 130
       17.898001    468048  23
       18.390001    480840   9
       18.896001    493996 123
       24.988001    652388  73
       25.482001    665232   9
       25.988001    678388 173
       34.026002    887376  10
       34.518002    900168   9
       35.024002    913324 110
       40.676002   1060276  82
       41.170002   1073120   9
       41.676002   1086276 182
       48.994002   1276544  43
       49.488002   1289388   9
       49.994002   1302544 143
       56.032003   1459532  30
       56.524003   1472324   9
       57.032003   1485532 130
    '''.splitlines()
    
    scenario = '''\
    30 7503
    23 6412
    73 1307
    10 3901
    82 4118
    43 7404
    30 3403
    '''.splitlines()
    
    it = iter(scenario)
    
    
    out = []
    for line in trigger:
        if line[-3] == ' ' and line[-2] != ' ':
            key = line[-2:]
            try:
                key2, extra = next(it).split()
            except StopIteration:
                raise ValueError('scenario ended too soon')
            if key2 != key:
                raise ValueError('scenario key does not match')
            line += ' ' + extra
        out.append(line)
    
    for line in out:
        print(line)
    # matches your desired output as given