Graph Paths Abstraction Algorithm Needed

I have a data structure holding a graph like the one in the following picture: Tree Data Structure

In this tree, a node can have any number of unique children from the levels below it. In tree in the picture represents a set of paths. Where every path should begin with a node from Level 1, and ends with a node of "*" mark. So the paths of the tree in the picture are:

A then C then G

A then C then G then J

A then D then G

A then D then G the J

A then D then K, and so on...

Actually my original tree is huge (around 2 Million sequences) and the maximum number of nodes per level is 61 (of 11 levels). So it causes many memory consumption problems in my application (a computer vision application for SAMSUNG).

My target is to have an iterative algorithm that represents these paths in a more compact string format. So I think we the problem is divided into three steps as follows. I have built the tree data structure (step 2), but still can not derive an iterative algorithm that gets the output string/sequence in step 3 from the tree.

1- Input String:

(A C G) | (A C G J) | (A D G) | (A D G J ) | (A D K) | ....,

Where "|" represents alternatives.

2- Building Tree Data Structure of These Paths.

3- Required Output String:

(A (C G [J]) | (D (G [J]) | K)) | (B ....).

Where where "|" represents alternatives and "[ ]" encloses options. The target output string should be optimized like there are not more common factors that can be taken to more simplify it.

Solution

You can use a modification of iterative DFS, which utilizes a stack to keep track of unprocessed nodes. This algorithm never stores more than 6 characters on the stack^* for any one node, and there are always fewer than N nodes on the stack (where N is the number of nodes in the graph). You've indicated that N will be at most 61*11=671, so there will be a maximum of about 4000 elements possible on the stack.

In the pseudocode below, a "destination" node is a starred node in the example above, e.g. G^*.

Initialization:

A dummy node Φ is introduced with an edge from Φ to each of the "root" nodes, e.g. nodes A and B above. The token for Φ is assumed to be a non-printing character, or you can explicitly check before adding it to the output string. The node Φ is pushed onto the stack before calling the function.

outString := ""
while stack not empty
   pop token
   if token is node
      outString := outString + node(token)  // Line 5 - explanation below
      if node(token) has children
         if node(token) is destination
            outString := outString + "["
            push "]"
         end
         if node(token) has multiple children
            for each child of node(token), from right to left
               push ")"
               push child
               push "("
               push "|"
            end
            pop // remove last "|"
         else
            push child
         end
      end

   else // token is ()[]|
      outString := outString + token
   end
end

The output of this algorithm for the first part of your graph (A and its children) is (with extra spaces added for clarity; the spaces can be easily added to the code):

A (C G [J]) | (D (G [J]) | (K))

You'll notice a deviation between your result and mine: the final node K is enclosed in parentheses in my solution. If this is undesirable (it could result in ugliness like A[(B)|(C)]), you can eliminate it by performing an additional check when you pop a node token off of the stack at the cost of some additional overhead. Simply replace Line 5 above with:

if (node(token) has no children
    AND last character of outString is "("
    AND next token on stack is ")")
      remove trailing "(" from outString
      concatenate token to outString
      pop ")" from stack and ignore
else
   outString := outString + node(token) // as above
end

Let me know if you have any questions or I've missed anything.

^* This will happen in the (probably highly unlikely) case of a node being written as |[(A)]. Most nodes will take up 4 or fewer characters in the stack.