Changing letters of a string to obtain maximum score

You are given a string and can change at most Q letters in the string. You are also given a list of substrings (each two characters long), with a corresponding score. Each occurance of the substring within the string adds to your total score. What is the maximum possible attainable score?

String length <= 150, Q <= 100, Number of Substrings <= 700

Example:

String = bpdcg

Q = 2

Substrings:

bz - score: 2

zd - score: 5

dm - score: 7

ng - score: 10

In this example, you can achieve the maximum score b changing the "p" in the string to a "z" and the "c" to an "n". Thus, your new string is "bzdng" which has a score of 2+5+10 = 17.

I know that given a string which already has the letters changed, the score can be checked in linear time using a dictionary matching algorithm such as aho-corasick (or with a slightly worse complexity, Rabin Karp). However, trying each two letter substitution will take too long and then checking will take too long.

Another possible method I thought was to work backwards, to construct the ideal string from the given substrings and then check whether it differs by at most two characters from the original string. However, I am not sure how to do this, and even if it could be done, I think that it would also take too long.

What is the best way to go about this?

Solution

An efficient way to solve this is to use dynamic programming.

Let L be the set of letters that start any of the length-2 scoring substrings, and a special letter "*" which stands for any other letter than these.

Let S(i, j, c) be the maximum score possible in the string (up to index i) using j substitutions, where the string ends with character c (where c in L).

The recurrence relations are a bit messy (or at least, I didn't find a particularly beautiful formulation of them), but here's some code that computes the largest score possible:

infinity = 100000000

def S1(L1, L2, s, i, j, c, scores, cache):
    key = (i, j, c)
    if key not in cache:
        if i == 0:
            if c != '*' and s[0] != c:
                v = 0 if j >= 1 else -infinity
            else:
                v = 0 if j >= 0 else -infinity
        else:
            v = -infinity
            for d in L1:
                for c2 in [c] if c != '*' else L2 + s[i]:
                    jdiff = 1 if s[i] != c2 else 0
                    score = S1(L1, L2, s, i-1, j-jdiff, d, scores, cache)
                    score += scores.get(d+c2 , 0)
                    v = max(v, score)
        cache[key] = v
    return cache[key]

def S(s, Q, scores):
    L1 = ''.join(sorted(set(w[0] for w in scores))) + '*'
    L2 = ''.join(sorted(set(w[1] for w in scores)))
    return S1(L1, L2, s + '.', len(s), Q, '.', scores, {})

print S('bpdcg', 2, {'bz': 2, 'zd': 5, 'dm': 7, 'ng': 10})

There's some room for optimisation:

the computation isn't terminated early if j goes negative
when given a choice, every value of L2 is tried, whereas only letters that can complete a scoring word from d need trying.

Overall, if there's k different letters in the scoring words, the algorithm runs in time O(QN*k^2). With the second optimisation above, this can be reduced to O(QNw) where w is the number of scoring words.