Search code examples
pythonstring-matching

Efficient way of finding potential substitutes in python string


Say I have a "ground truth" string such as ABA1234, and I have a "predicted" string that I want to compare it to, for example _ABA1234, and I have a list of acceptable substitutions for example

{
 "A": ["_A", "A"],
 "1": ["I", "1"],
}

What is the most efficient way of deciding whether or not the predicted string is equal to the target "ground truth" string with the given acceptable substitutions?

The brute force method - of generating candidates from the ground truth string by applying the substitutions is exponential, and not suitable for my purposes. Does anyone know a sub-exponential runtime algo for this kind of problem? Can regex help here?


Solution

  • As I said in comments, with the specific substitutions given by the OP as example, there isn't much ambiguity. Consider a regex pattern based on the given truth and substitutions:

    >>> make_pattern('ABA1234', {'A': ['_A', 'A'], '1': ['I', '1']})
    re.compile(r'(?:_A|A)B(?:_A|A)(?:I|1)234', re.UNICODE)
    

    When the regex encounters '_A' in the test string, at that point in the pattern the corresponding possibilities are (?:_A|A), which is immediately decidable (no backtracking). When a 'I' is encountered, then the pattern would have (?:I|1), also decidable without backtracking. At the first non-match, the whole pattern is abandoned and the result is None.

    As @KellyBundy shows, it is easy to construct simple examples that will lead to much higher backtracking complexity. For example:

    {
        'A': ['A', 'AB'],
        'B': ['BC', 'C'],
    }
    

    This answer assumes that the "exponential complexity" the OP mentions refers to the size of the tree of possible candidates (all the strings that could match the truth given the substitutions; in the OP's example, that size is 8 (2**3) because there are 3 parts with two alternates each). But we don't have to explore that tree at all, as indicated above.

    If the complexity comes from ambiguous substitutions instead (as per Kelly's examples), then this answer won't help.

    With these caveats, as the question currently stands and per OP's confirmation, simply compiling a regex pattern gives excellent performance:

    Edit: at OP's request, we now allow arbitrary prefix and suffix to be present in the test string.

    import re
    
    def make_pattern(truth, substitutions):
        pat = ''.join([
            f"(?:{'|'.join([re.escape(s) for s in substitutions.get(c)])})"
            if c in substitutions else re.escape(c)
            for c in truth
        ])
        return re.compile(f'{pat}')
    

    Then:

    substitutions = {'A': ['_A', 'A'], '1': ['I', '1']}
    truth = 'ABA1234'
    
    pat = make_pattern(truth, substitutions)
    test = 'xxxx_ABA1234xxxxx'
    
    >>> bool(pat.search(test))
    True
    
    %timeit bool(pat.search(test))
    330 ns ± 0.0872 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
    

    Let's generate arbitrarily longer random input:

    import random
    
    def gen(n, noise=.2):
        base = list(substitutions.items()) + [
            s for v in substitutions.values() for s in v
        ] + list('BCD02345_')
        parts = random.choices(base, k=n)  # may contain single elems and tuples
        truth = ''.join([s[0] if isinstance(s, tuple) else s for s in parts])
        test = ''.join([
            random.choice(s[1]) if isinstance(s, tuple) else s for s in parts
        ])
        # add noise prefix and suffix
        a = [s[0] if isinstance(s, tuple) else s for s in base]
        prefix = ''.join(random.choices(a, k=round(n * noise)))
        suffix = ''.join(random.choices(a, k=round(n * noise)))
        test = f'{prefix}{test}{suffix}'
        return truth, test
    

    Examples:

    random.seed(0)  # for reproducibility
    n = 50
    truth, test = gen(n)
    
    >>> truth
    '43BACB3ICD5CI30A5_45I252C1B05_C4A4DA2142AC5AI5_ADA_'
    
    >>> test
    '_5_CII1I5A43BACB3ICD5CI30A5_45I252C1B05_C4A4D_A2142_AC5AI5_ADA_3DI13IA_A4A'
    
    pat = make_pattern(truth, substitutions)
    >>> bool(pat.search(test))
    True
    
    %timeit bool(pat.search(test))
    527 ns ± 0.174 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
    

    With n = 10_000 (so both truth and test will be no smaller than that):

    random.seed(0)  # for reproducibility
    n = 10_000
    truth, test = gen(n)
    
    >>> len(truth), len(test)
    (10645, 15238)
    
    pat = make_pattern(truth, substitutions)
    >>> bool(pat.search(test))
    True
    
    %timeit bool(pat.search(test))
    129 µs ± 21.6 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
    

    (Meanwhile, building the pattern itself is very predictable --no backtracking-- and takes ~4ms for 10K-char string).