Search code examples
pythonalgorithmlevenshtein-distancefuzzy-comparisonlcs

How can I find the best fit subsequences of a large string?


Say I have one large string and an array of substrings that when joined equal the large string (with small differences).

For example (note the subtle differences between the strings):

large_str = "hello, this is a long string, that may be made up of multiple
 substrings that approximately match the original string"

sub_strs = ["hello, ths is a lng strin", ", that ay be mad up of multiple",
 "subsrings tat aproimately ", "match the orginal strng"]

How can I best align the strings to produce a new set of sub strings from the original large_str? For example:

["hello, this is a long string", ", that may be made up of multiple",
 "substrings that approximately ", "match the original string"]

Additional Info

The use case for this is to find the page breaks of the original text from the existing page breaks of text extracted from a PDF document. Text extracted from the PDF is OCR'd and has small errors compared to the original text, but the original text does not have page breaks. The goal is to accurately page break the original text avoiding the OCR errors of the PDF text.


Solution

    1. Concatenate the substrings
    2. Align the concatenation with the original string
    3. Keep track of which positions in the original string are aligned with the boundaries between the substrings
    4. Split the original string on the positions aligned with those boundaries

    An implementation using Python's difflib:

    from difflib import SequenceMatcher
    from itertools import accumulate
    
    large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"
    
    sub_strs = [
      "hello, ths is a lng strin",
      ", that ay be mad up of multiple",
      "subsrings tat aproimately ",
      "match the orginal strng"]
    
    sub_str_boundaries = list(accumulate(len(s) for s in sub_strs))
    
    sequence_matcher = SequenceMatcher(None, large_str, ''.join(sub_strs), autojunk = False)
    
    match_index = 0
    matches = [''] * len(sub_strs)
    
    for tag, i1, i2, j1, j2 in sequence_matcher.get_opcodes():
      if tag == 'delete' or tag == 'insert' or tag == 'replace':
        matches[match_index] += large_str[i1:i2]
        while j1 < j2:
          submatch_len = min(sub_str_boundaries[match_index], j2) - j1
          while submatch_len == 0:
            match_index += 1
            submatch_len = min(sub_str_boundaries[match_index], j2) - j1
          j1 += submatch_len
      else:
        while j1 < j2:
          submatch_len = min(sub_str_boundaries[match_index], j2) - j1
          while submatch_len == 0:
            match_index += 1
            submatch_len = min(sub_str_boundaries[match_index], j2) - j1
          matches[match_index] += large_str[i1:i1+submatch_len]
          j1 += submatch_len
          i1 += submatch_len
    
    print(matches)
    

    Output:

    ['hello, this is a long string', 
     ', that may be made up of multiple ', 
     'substrings that approximately ', 
     'match the original string']