find an optimal string(with minimal length) from a set of strings so that resulting string must contain anargram of every string

Suppose you have two strings. "abcde" and "dfbag". The resulting string can be "ceabdfg". The first five characters of the resulting string "ceadb" is the anargram of "abcde". The last five characters "abdfg" is also an anargram of second string(dfbag).

You can find the optimal string of two strings very easily.

Take the common set of characters all present in two strings. for the above example 'a', 'd' and 'b' are present in both two strings. So take those character in any order.(adb, dba, bda , etc.)
Take the remaining characters of first string and put them in front. 'c' and 'e' are the remaining characters for first string. So put them in front. so we found "ceabd". its an anargram of first string.
Take the remaining characters of second string and put them at the end of the string. 'f' and 'g' are remaining characters from second string. If we put them after the string found in 2, we get "ceadbfg" where "adbfg" is the anargram of the second string.

My question is what can be the algorithm to find the optimal resulting string if there are N number of strings?

Example:

We have 4 strings."abc", "ca", "bd", "d"

The optimal string can be "dbca".

Substring "bca" of "dbca" is the anargram of "abc"

Substring "ca" of "dbca" is the anargram of "ca"

Substring "db" of "dbca" is the anargram of "bd"

Substring "d" of "dbca" is the anargram of "d"

I am looking for an algorithm for this. (Ofcourse not the brute force one which checks the all combination :)). Without brute force, any idea?

Solution

This problem is NP-complete, and I show that by reducing HAMPATH (finding a Hamiltonian path in a graph) to it.

Note: I consider the alphabet being arbitrary large.

Consider a graph with N vertices and M edges where vertex V is incident to edges E_V1, E_V2, ..., E_Vk. We make a unique character for each edge. Now each vertex will correspond to a string E_V1 E_V2 ... E_Vk (that is, a string made of all characters for incident edges).

There are N strings of total length 2M. Note that each two strings have no more than two characters in common, because no two vertices share more than one edge. So the length of the superstring is at least 2M - (N - 1): it happens if each two subsequent strings share a character. Now note that this may happen only if there is an edge between any two subsequent vertices, that is, the graph has a Hamiltonian path.

Thus, if we check that there is a superstring of length at most 2M - (N - 1) in polynomial time then we solve HAMPATH in polynomial time. So your problem is NP-hard.

NP-completeness is obvious: a string itself is a certificate.