algorithm graph substring compression sequence

For given sequences, build graph which will contain path for each sequence

Assume we have few sequences:

cag

abg

ahcbg

afab

I want to generate shortest possible sequence containing all such subsequances:

ahcfabg

I know that this can be achieved by building directed acyclic graph which will contain all alternative path from those sequences:

   h        b
 /  \      /  \
a -> c -> a -> g
 \       /   
  ----> f

And then use topological sort to generate final sequence. But I have problem with algorithmical graph creation.

Real live problem is that i want to build word clock. So having two possible texts eg. "It's ten past two" and "It's twenty five to two" we can design such graph:

It's -> twenty five to -> two
    \                   /
     --- ten past -----

But we can also "compress" it to

It's -> t -> w -> en -> ty five  -> t -> o -> two
                     \            /         
                       -> pas ->

Again after topological sort we can obtain final order of letters. Then It will be possible to show two sentences. When displaying each letter we can light up some of the letters (bolded):

It's twenty five pasto two
It's twenty five pasto two

I'm thinking about algorithmic way for finding such "compressed" graph for all possible hour written by words. I aim to use as few letters as possible, to simplify this clock display.

I have tried building graph such that every letter has assigned number with order of appearance in sentence:

hhha -> (h,1) (h,2) (h,3) (a,1)
ahhha -> (a,1) (h,1) (h,2) (h,3) (a,2)

Iterating though all possible hour text every pair of letters adds one edge to graph. When adding new edge to graph I check if graph has became cyclic, If yes new node with increased number is created and edge is again check to be added.

hhha

(h,1) -> (h,2) -> (h,3) -> (a,1)

Then acaf

Adding a,1:

(h,1) -> (h,2) -> (h,3) -> (a,1)

Adding (h,1)

(h,1) -> (h,2) -> (h,3) -> (a,1)
 ^                          |
  ---------------------------

Cycle detected

Adding (h,2)

(h,1) -> (h,2) -> (h,3) -> (a,1)
           ^                 |
           ------------------

Cycle detected

Adding (h,3)

(h,1) -> (h,2) -> (h,3) -> (a,1)
                    ^        |
                     ---------

Adding (h,4)

(h,1) -> (h,2) -> (h,3) -> (a,1)
                             \
                             (h,4)

Adding (h,2) as (h,5) after few attempts with cycle detection

(h,1) -> (h,2) -> (h,3) -> (a,1)
                             \
                             (h,4) - (h,5)

And simillary (h,3) and (a,2)

(h,1) -> (h,2) -> (h,3) -> (a,1)
                             \
                             (h,4) -> (h,5) -> (h,6) -> (a,2)

This kind of works but resulting graph is not optimal:

hhhahhha

instead of

ahhha

Problem with it is that all new nodes are added as first occurring nodes, instead of checking if there is already some common sub paths in sequqnce and graph. But checking for sub sequance of new text and all paths in graph will be very complicated, and I want to find better way.

I was also thinking some probability calculation for better prediction where node should be merged, but I have exact idea how to implement this. I could also find some common longest substring between every string to propose initial paths, but I don't believe this will provide good results.

Solution

The problem with your approach is that you do not actually want to construct that graph. Its nodes can be various combinations of positions in all of the subsequences. If you have many subsequences, that number of combinations grows exponentially.

Instead you want to find an optimal path through that graph without constructing the graph. This can be done with an A* search which based on a heap. Most languages have those, though you may find them called something like a priority queue.

Note that we cannot always do that. This is because your problem is NP-hard. There are a variety of approaches that produce guaranteed decent answers. But we're going to look for the optimal, and sometimes that actually won't be that hard to find.

Our attempt will involve keeping track of many places that you might be. Which requires data structures. The only sane way to keep track of so many similar data structures is to use some sort of persistent data structure. At a minimum I'll be using dictionaries, lists, and sets. It will also be helpful to maintain a fixed length list, make it persistent, and make it hashable.

There is no actual requirement for persistent data structures. It just makes things much more efficient if you have them. And it allows me to keep adding per node data structures without worrying about how big they are.

All that said, we are starting with a list called subsequences. We can refer to each subsequence by index, subsequences[i], and each letter as subsequences[i][j].

Now consider a subsequence for the string 'twenty five to', We can build an array of counts of more of the same letter needed in that subsequence. So, for example, after the first t we need another 2 ts, so the first entry is 2. In this string that is [2, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0]. (Done by hand, might have a count off.) Do this for each subsequence and we wind up with still_need, a list of list of counts. This makes it cheap to distinguish between whether subsequences[i][j] still needs 5 more copies of that letter, or none.

Now, our position in the graph is a list of where we are in each subsequence. That is, j = [position[i]] so the character we want is subsequence[i][position[i]] and the number more we need of that character is still_need[i][position[i]].

There may be many ways to get to a given position. We will only want to keep track of the first path we find to it. Therefore we need a set of seen positions. We can do this very efficiently if we have a persistent hashable list of fixed length.

Now the way our search structure will work is that it will be a heap of tuples. Let's assume a min-heap. We start the tuples with the things that we are comparing. That tuple will have:

(least_steps, least_steps_left, counter, position, path, next_choice, remaining)

Here is what each has.

least_steps is the length of the shortest possible sequence we could wind up with. It is a lower bound, and I'll go through the calculation later. Having this first means we're always given something that is potentially optimal.

least_steps_left is the length of the shortest possible sequence from here to the end. Again, I'll go through the calculation later. This causes us to look at, of all potentially optimal answers, the one that is closest to being done.

counter is just a counter of when this was added to the heap. It exists to force a tiebreak so that the heap will not compare the complex payloads that come next.

position we discussed above. It is exactly where we are in all subsequences.

path is either None or (letter, prev_path). Thus if the sequence was ['h', 'e', 'l', 'l', 'o'] then the path would be encoded as ['o', ['l', ['l', ['e', ['h', None]]]]]. This may seem like a weird encoding, but a letter and a pointer only takes O(1) space, and all the rest of it is shared with other nodes. This is a useful trick to know for things like getting the actual solution out of dynamic programming, instead of just the count of solutions.

next_choice is a dictionary mapping letter to the set of indexes of subsequences that want that letter next.

remaining is a dictionary mapping letters to lists of sets of counts. The idea is that remaining['t'][3] gives the set of subsequences that need a t, and then will need 3 more after that. That's a mouthful, so suppose 'twenty five to' was subsequence 5. And we are at position 4, which is we're looking for the second t in twenty. Then remaining would contain

{
    't': [{}, {5}],
    'y': [{5}],
    ' ': [{}, {5}],
    'f': [{5}],
    'i': [{5}],
    'v': [{5}],
    'e': [{5}],
    'o': [{5}],
}

The importance of remaining is that the sum of the lengths of its values is a lower bound on how quickly the sequence can end. That sum is therefore least_steps_left. And therefore least_steps is how many steps to this point plus least_steps_left. While this could be calculated from scratch, it is more efficient to keep track of those counts and change them when you change a letter.

With all that said, here is some untested code for the logic. It is untested, and I will be making method calls to an API for a persistent data structure that has to be implemented. But it shows the idea. Note that seen, subsequences and still_need are NOT persistent data structures.

while True:
    # Get them all. Note that we ignore the counter.
    least_steps, least_steps_left, _, position, path, next_choice, remaining = heapq.heappop(heap)
    if position in seen:
        continue # Don't repeat work.
    elif position == final_position:
        # Decode the path and return it
        answer = []
        while path is not None:
            answer.append(path[0])
            path = path[1]
        return list(reversed(answer))
    else:
        seen.add(position)

    # How many steps will we have taken after our next?
    steps = least_steps - least_steps_left + 1
    for letter in next_choice.keys():
        next_position = position
        next_path = [letter, path]
        next_next_choice, indices = next_choice.pop(letter)
        next_remaining = remaining
        for i in indices:
            # Get the entry with this letter.
            remain_by_letter = next_remaining[letter]
            pos = still_need[i][position[i]]
            # Backtrack to get persistent versions without i.
            remain_remove_i = remain_by_letter[pos].remove(i)
            if len(remain_by_letter) == pos + 1 and 0 == len(remain_remove_i):
                remain_by_letter = remain_by_letter.pop()
                estimated_steps_left -= 1
            else:
                remain_by_letter = remain_by_letter.set(pos, remain_remove_i)
            next_remaining = next_remaining.set(letter, remain_by_letter)

            # Now we add the new one.
            position = position.set(i, position[i] + 1)
            j = position[i]
            if j < len(subsequences[i]):
                # Where does this go?
                next_letter = subsequences[i][j]
                next_pos = still_need[i][j]

                remain_by_letter = next_remaining[next_letter]
                remain_with_i = remain_by_letter[next_pos].add(i)
                remain_by_letter = remain_by_letter.set(i, remain_with_i)
                next_remaining = next_remaining.set(next_letter, remain_by_letter)
                next_choice_by_letter = next_choice.get(next_letter, PersistentSet()).add(i)
                next_next_choice = next_next_choice.set(next_letter, next_choice_by_letter)

        # Now add this to the queue
        counter += 1
        heapq.heappush(heap, (steps + estimated_next_steps, estimated_next_steps, counter, next_position, next_next_choice, next_remaining))

Note that once you have the optimal sequence, picking out the letters that match any given subsequence is straightforward. You can either solve it once for each subsequence, or on the fly.

Also note that when this algorithm goes wrong, it goes very wrong. You'll run out of time and memory. A very simple way to fix that is to keep a count of how many times you've backtracked to each possible number of steps. If you backtrack too many times, just refuse to process. This allows you to find suboptimal solutions. But they'll still generally be pretty good.