I have a data structure holding a graph like the one in the following picture:
In this tree, a node can have any number of unique children from the levels below it. In tree in the picture represents a set of paths. Where every path should begin with a node from Level 1, and ends with a node of "*" mark. So the paths of the tree in the picture are:
A then C then G
A then C then G then J
A then D then G
A then D then G the J
A then D then K, and so on...
Actually my original tree is huge (around 2 Million sequences) and the maximum number of nodes per level is 61 (of 11 levels). So it causes many memory consumption problems in my application (a computer vision application for SAMSUNG).
My target is to have an iterative algorithm that represents these paths in a more compact string format. So I think we the problem is divided into three steps as follows. I have built the tree data structure (step 2), but still can not derive an iterative algorithm that gets the output string/sequence in step 3 from the tree.
(A C G) | (A C G J) | (A D G) | (A D G J ) | (A D K) | ....
,
Where "|" represents alternatives.
(A (C G [J]) | (D (G [J]) | K)) | (B ....)
.
Where where "|" represents alternatives and "[ ]" encloses options. The target output string should be optimized like there are not more common factors that can be taken to more simplify it.
You can use a modification of iterative DFS, which utilizes a stack to keep track of unprocessed nodes. This algorithm never stores more than 6 characters on the stack* for any one node, and there are always fewer than N nodes on the stack (where N is the number of nodes in the graph). You've indicated that N will be at most 61*11=671, so there will be a maximum of about 4000 elements possible on the stack.
In the pseudocode below, a "destination" node is a starred node in the example above, e.g. G*.
Initialization:
A dummy node Φ is introduced with an edge from Φ to each of the "root" nodes, e.g. nodes A and B above. The token for Φ is assumed to be a non-printing character, or you can explicitly check before adding it to the output string. The node Φ is pushed onto the stack before calling the function.
outString := ""
while stack not empty
pop token
if token is node
outString := outString + node(token) // Line 5 - explanation below
if node(token) has children
if node(token) is destination
outString := outString + "["
push "]"
end
if node(token) has multiple children
for each child of node(token), from right to left
push ")"
push child
push "("
push "|"
end
pop // remove last "|"
else
push child
end
end
else // token is ()[]|
outString := outString + token
end
end
The output of this algorithm for the first part of your graph (A and its children) is (with extra spaces added for clarity; the spaces can be easily added to the code):
A (C G [J]) | (D (G [J]) | (K))
You'll notice a deviation between your result and mine: the final node K is enclosed in parentheses in my solution. If this is undesirable (it could result in ugliness like A[(B)|(C)]
), you can eliminate it by performing an additional check when you pop a node token off of the stack at the cost of some additional overhead. Simply replace Line 5 above with:
if (node(token) has no children
AND last character of outString is "("
AND next token on stack is ")")
remove trailing "(" from outString
concatenate token to outString
pop ")" from stack and ignore
else
outString := outString + node(token) // as above
end
Let me know if you have any questions or I've missed anything.
* This will happen in the (probably highly unlikely) case of a node being written as |[(A)]
. Most nodes will take up 4 or fewer characters in the stack.