Search code examples
pythonstringpdfduplicatesextract

Python string: extract duplicated and randomly merged substring


My problem is when the original substring already has duplicate characters, my current function sometimes retains the wrong duplicate character.


Solution

  • You can think of the phenomenon as a duplicating string trailing behind the main string by some number of characters. Since the duplicating string at index i should always match the main string at index i, you can use this characteristic as a way to determine if an incoming character belongs to the duplicating string or the main string. To handle cases where the correct string has two of the same consecutive characters and the second of them is encountered, you can use a backtracking algorithm to find a path that leads to a matching duplicate string.

    Note that the duplicating characters are most likely a result of an emulation of a bold font, but since there's nothing visible in duplicating spaces they're unfortunately identified by the PDF scanner as one space. As a workaround you can artificially replace each space in the input with two spaces, and then replace any two consecutive spaces with one space for output. Note also that your input string may be missing a trailing space since the scanner is unable to identity one without a visible boundary, you can artificially append a trailing space to the input and strip trailing spaces afterwards:

    def dedupe(string):
        def _dedupe(pos, dupe):
            if pos == size:
                return dupe == half
            char = string[pos]
            if (dupe < len(output) and output[dupe] == char and
                    _dedupe(pos + 1, dupe + 1)):
                return True
            output.append(char)
            if _dedupe(pos + 1, dupe):
                return True
            output.pop()
    
        string = (string + ' ').replace(' ', '  ')
        size = len(string)
        half = size // 2
        output = []
        if _dedupe(0, 0):
            return ''.join(output).replace('  ', ' ').rstrip()
        raise ValueError('No matching duplicate found.')
    

    so that:

    print(dedupe(
    "CCoonnttiinnggeenntt CCoouuppoonn 16.01068.0%0 8p%.a .p (.Ma.o (nMtholyn)thly)"
    ))
    

    outputs:

    Contingent Coupon 16.008% p.a. (Monthly)
    

    Demo here