Search code examples
stringalgorithmdynamicgraphlanguage-agnostic

String manipulation with dynamic programming


I have a problem where I have a string of length N, where (1 ≤ N ≤ 10^5). This string will only have lower case letters.

We have to rewrite the string so that it has a series of "streaks", where the same letter is included at least K (1 ≤ K ≤ N) times in a row.

It costs a_ij to change a single specific letter in the string from i to j. There are M different possible letters you can change each letter to.

Example: "abcde" is the input string. N = 5 (length of "abcde"), M = 5 (letters are A, B, C, D, E), and K = 2 (each letter must be repeated at least 2 times) Then we are given a M×M matrix of values a_ij, where a_ij is an integer in the range 0…1000 and a_ii = 0 for all i.

0 1 4 4 4

2 0 4 4 4

6 5 0 3 2

5 5 5 0 4

3 7 0 5 0

Here, it costs 0 to change from A to A, 1 to change from A to B, 4 to change from A to C, and so on. It costs 2 to change from B to A.

The optimal solution in this example is to change the a into b, change the d into e, and then change both e’s into c’s. This will take 1 + 4 + 0 + 0 = 5 moves, and the final combo string will be "bbccc".

It becomes complicated as it might take less time to switch from using button i to an intermediate button k and then from button k to button j rather than from i to j directly (or more generally, there may be a path of changes starting with i and ending with j that gives the best overall cost for switching from button i ultimately to button j).

To solve for this issue, I am treating the matrix as a graph, and then performing Floyd Warshall to find the fastest time to switch letters. This will take O(M^3) which is only 26^3.

My next step is to perform dynamic programming on each additional letter to find the answer. If someone could give me advice on how to do this, I would be thankful!


Solution

  • Here are some untested ideas. I'm not sure if this is efficient enough (or completely worked out) but it looks like 26 * 3 * 10^5. The recurrence could be converted to a table, although with higher Ks, memoisation might be more efficient because of reduced state possibilities.

    Assume we've recorded 26 prefix arrays for conversion of the entire list to each of the characters using the best conversion schedule, using a path-finding method. This lets us calculate the cost of a conversion of a range in the string in O(1) time, using a function, cost.

    A letter in the result can be one of three things: either it's the kth instance of character c, or it's before the kth, or it's after the kth. This leads to a general recurrence:

    f(i, is_kth, c) ->
      cost(i - k + 1, i, c) + A
      where
        A = min(
          f(i - k, is_kth, c'),
          f(i - k, is_after_kth, c')
        ) forall c'
    

    A takes constant time since the alphabet is constant, assuming earlier calls to f have been tabled.

    f(i, is_before_kth, c) ->
      cost(i, i, c) + A
      where
        A = min(
          f(i - 1, is_before_kth, c),
          f(i - 1, is_kth, c'),
          f(i - 1, is_after_kth, c')
        ) forall c'
    

    Again A is constant time since the alphabet is constant.

    f(i, is_after_kth, c) ->
      cost(i, i, c) + A
      where
        A = min(
          f(i - 1, is_after_kth, c),
          f(i - 1, is_kth, c)
        )
    

    A is constant time in the latter. We would seek the best result of the recurrence applied to each character at the end of the string with either state is_kth or state is_after_kth.