In Place Run Length Encoding Algorithm

I encountered an interview question:

Given a input String: aaaaabcddddee, convert it to a5b1c1d4e2.

One extra constraint is, this needs to be done in-place, means no extra space(array) should be used.

It is guaranteed that the encoded string will always fit in the original string. In other words, string like abcde will not occur, since it will be encoded to a1b1c1d1e1 which occupies more space than the original string.

One hint interviewer gave me was to traverse the string once and find the space that is saved.

Still I am stuck as some times, without using extra variables, some values in the input string may be overwritten.

Any suggestions will be appreciated?

Solution

This is a good interview question.

Key Points

There are 2 key points:

Single character must be encoded as c1;
The encoded length will always be smaller than the original array.

Since 1, we know each character requires at least 2 places to be encoded. This is to say, only single character will require more spaces to be encoded.

Simple Approach

From the key points, we notice that the single character causes us a lot problem during the encoding, because they might not have enough place to hold the encoded string. So how about we leave them first, and compressed the other characters first?

For example, we encode aaaaabcddddee from the back while leaving the single character first, we will get:

aaaaabcddddee
_____a5bcd4e2

Then we could safely start from the beginning and encoding the partly encoded sequence, given the key point 2 such that there will be enough spaces.

Analysis

Seems like we've got a solution, are we done? No. Consider this string:

aaa3dd11ee4ff666

The problem doesn't limit the range of characters, so we could use digit as well. In this case, if we still use the same approach, we will get this:

aaa3dd11ee4ff666
__a33d212e24f263

Ok, now tell me, how do you distinguish the run-length from those numbers in the original string?

Well, we need to try something else.

Let's define Encode Benefit (E) as: the length difference between the encoded sequence and the original consecutive character sequence..

For example, aa has E = 0, since aa will be encoded to a2, and they have no length difference; aaa has E = 1, since it will be encoded as a3, and the length difference between the encoded and the original is 1. Let's look at the single character case, what's its E? Yes, it's -1. From the definition, we could deduce the formula for E: E = ori_len - encoded_len.

Now let's go back to the problem. From key point 2, we know the encoded string will always be shorter than the original one. How do we use E to rephrase this key point?

Very simple: sigma(E_i) >= 0, where E_i is the Encode Benefit of the i^th consecutive character substring.

For example, the sample you gave in your problem: aaaaabcddddee, can be broken down into 5 parts:

E(0) = 5 - 2 = 3  // aaaaa -> a5
E(1) = 1 - 2 = -1 // b -> b1
E(2) = 1 - 2 = -1 // c -> c1
E(3) = 4 - 2 = 2  // dddd -> d4
E(4) = 2 - 2 = 0  // ee -> e2

And the sigma will be: 3 + (-1) + (-1) + 2 + 0 = 3 > 0. This means there will be 3 spaces left after encoding.

However, from this example, we could see a potential problem: since we are doing summing, even if the final answer is bigger than 0, it's possible to get some negatives in the middle!

Yes, this is a problem, and it's quite serious. If we get E falls below 0, this means we do not have enough space to encode the current character and will overwrite some characters after it.

But but but, why do we need to sum it from the first group? Why can't we start summing from somewhere in the middle to skip the negative part? Let's look at an example:

2 0 -1 -1 -1 1 3 -1

If we sum up from the beginning, we will fall below 0 after adding the third -1 at index 4 (0-based); if we sum up from index 5, loop back to index 0 when we reach the end, we have no problem.

Algorithm

The analysis gives us an insight on the algorithm:

Start from the beginning, calculate E of the current consecutive group, and add to the total E_total;
If E_total is still non-negative (>= 0), we are fine and we could safely proceed to the next group;
If the E_total falls below 0, we need to start over from the current position, i.e. clear E_total and proceed to the next position.

If we reach the end of the sequence and E_total is still non-negative, the last starting point is a good start! This step takes O(n) time. Usually we need to loop back and check again, but since key point 2, we will definitely have a valid answer, so we could safely stop here.

Then we could go back to the starting point and start traditional run-length encoding, after we reach the end we need to go back to the beginning of the sequence to finish the first part. The tricky part is, we need to make use the remaining spaces at the end of the string. After that, we need to do some shifting just in case we have some order issues, and remove any extra white spaces, then we are finally done :)

Therefore, we have a solution (the code is just a pseudo and hasn't been verified):

// find the position first
i = j = E_total = pos = 0;
while (i < s.length) {
    while (s[i] == s[j]) j ++;
    E_total += calculate_encode_benefit(i, j);
    if (E_total < 0) {
        E_total = 0;
        pos = j;
    }
    i = j;
}

// do run length encoding as usual:
// start from pos, end with len(s) - 1, the first available place is pos
int last_available_pos = runlength(s, pos, len(s)-1, pos);
// a tricky part here is to make use of the remaining spaces from the end!!!
int fin_pos = runlength(s, 0, pos-1, last_available_pos);
// eliminate the white
eliminate(s, fin_pos, pos);
// update last_available_pos because of elimination
last_available_pos -= pos - fin_pos < 0 ? 0 : pos - fin_pos;
// rotate back
rotate(s, last_available_pos);

Complexity

We have 4 parts in the algorithm:

Find the starting place: O(n)
Run-Length-Encoding on the whole string: O(n)
White space elimination: O(n)
In place string rotation: O(n)

Therefore we have O(n) in total.

Visualization

Suppose we need to encode this string: abccdddefggggghhhhh

First step, we need to find the starting position:

Group 1: a     -> E_total += -1 -> E_total = -1 < 0 -> E_total = 0, pos = 1;
Group 2: b     -> E_total += -1 -> E_total = -1 < 0 -> E_total = 0, pos = 2;
Group 3: cc    -> E_total += 0  -> E_total = 0 >= 0 -> proceed;
Group 4: ddd   -> E_total += 1  -> E_total = 1 >= 0 -> proceed;
Group 5: e     -> E_total += -1 -> E_total = 0 >= 0 -> proceed;
Group 6: f     -> E_total += -1 -> E_total = -1 < 0 -> E_total = 0, pos = 9;
Group 7: ggggg -> E_total += 3  -> E_total = 3 >= 0 -> proceed;
Group 8: hhhhh -> E_total += 3  -> E_total = 6 >= 0 -> end;

So the start position will be 9:

         v this is the starting point
abccdddefggggghhhhh
abccdddefg5h5______
             ^ last_available_pos, we need to make use of these remaining spaces
abccdddefg5h5a1b1c2
d3e1f1___g5h5a1b1c2
      ^^^ remove the white space
d3e1f1g5h5a1b1c2
          ^ last_available_pos, rotate
a1b1c2d3e1f1g5h5

Last Words

This question is not trivial, and actually glued several traditional coding interview questions together naturally. A suggested mind flow would be:

observe the pattern and figure out the key points;
realize the reason for insufficient space is because of encoding single character;
quantize the benefit/cost of encoding on each consecutive characters group (a.k.a Encoding Benefit);
use the quantization you proposed to explain the original statement;
figure out the algorithm to find a good starting point;
figure out how to do run-length-encoding with a good starting point;
realize you need to rotate the encoded string and eliminate the white spaces;
figure out the algorithm to do in place string rotation;
figure out the algorithm to do in place white space elimination.

To be honest, it's a bit challenging for an interviewee to come up with a solid algorithm in a short time, so your analysis flow really matters. Don't say nothing, show your mind flow, this helps the interviewer to find out your current stage.