python algorithm levenshtein-distance fuzzy-comparison lcs

How can I find the best fit subsequences of a large string?

Say I have one large string and an array of substrings that when joined equal the large string (with small differences).

For example (note the subtle differences between the strings):

large_str = "hello, this is a long string, that may be made up of multiple
 substrings that approximately match the original string"

sub_strs = ["hello, ths is a lng strin", ", that ay be mad up of multiple",
 "subsrings tat aproimately ", "match the orginal strng"]

How can I best align the strings to produce a new set of sub strings from the original large_str? For example:

["hello, this is a long string", ", that may be made up of multiple",
 "substrings that approximately ", "match the original string"]

Additional Info

The use case for this is to find the page breaks of the original text from the existing page breaks of text extracted from a PDF document. Text extracted from the PDF is OCR'd and has small errors compared to the original text, but the original text does not have page breaks. The goal is to accurately page break the original text avoiding the OCR errors of the PDF text.

Solution

Concatenate the substrings
Align the concatenation with the original string
Keep track of which positions in the original string are aligned with the boundaries between the substrings
Split the original string on the positions aligned with those boundaries

An implementation using Python's difflib:

from difflib import SequenceMatcher
from itertools import accumulate

large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"

sub_strs = [
  "hello, ths is a lng strin",
  ", that ay be mad up of multiple",
  "subsrings tat aproimately ",
  "match the orginal strng"]

sub_str_boundaries = list(accumulate(len(s) for s in sub_strs))

sequence_matcher = SequenceMatcher(None, large_str, ''.join(sub_strs), autojunk = False)

match_index = 0
matches = [''] * len(sub_strs)

for tag, i1, i2, j1, j2 in sequence_matcher.get_opcodes():
  if tag == 'delete' or tag == 'insert' or tag == 'replace':
    matches[match_index] += large_str[i1:i2]
    while j1 < j2:
      submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      while submatch_len == 0:
        match_index += 1
        submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      j1 += submatch_len
  else:
    while j1 < j2:
      submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      while submatch_len == 0:
        match_index += 1
        submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      matches[match_index] += large_str[i1:i1+submatch_len]
      j1 += submatch_len
      i1 += submatch_len

print(matches)

Output:

['hello, this is a long string', 
 ', that may be made up of multiple ', 
 'substrings that approximately ', 
 'match the original string']