Search code examples
calgorithmoptimizationcharts

Graph Paths Abstraction Algorithm Needed


I have a data structure holding a graph like the one in the following picture: Tree Data Structure

In this tree, a node can have any number of unique children from the levels below it. In tree in the picture represents a set of paths. Where every path should begin with a node from Level 1, and ends with a node of "*" mark. So the paths of the tree in the picture are:

A then C then G

A then C then G then J

A then D then G

A then D then G the J

A then D then K, and so on...

Actually my original tree is huge (around 2 Million sequences) and the maximum number of nodes per level is 61 (of 11 levels). So it causes many memory consumption problems in my application (a computer vision application for SAMSUNG).

My target is to have an iterative algorithm that represents these paths in a more compact string format. So I think we the problem is divided into three steps as follows. I have built the tree data structure (step 2), but still can not derive an iterative algorithm that gets the output string/sequence in step 3 from the tree.

1- Input String:

(A C G) | (A C G J) | (A D G) | (A D G J ) | (A D K) | ....,

Where "|" represents alternatives.

2- Building Tree Data Structure of These Paths.

3- Required Output String:

(A (C G [J]) | (D (G [J]) | K)) | (B ....).

Where where "|" represents alternatives and "[ ]" encloses options. The target output string should be optimized like there are not more common factors that can be taken to more simplify it.


Solution

  • You can use a modification of iterative DFS, which utilizes a stack to keep track of unprocessed nodes. This algorithm never stores more than 6 characters on the stack* for any one node, and there are always fewer than N nodes on the stack (where N is the number of nodes in the graph). You've indicated that N will be at most 61*11=671, so there will be a maximum of about 4000 elements possible on the stack.

    In the pseudocode below, a "destination" node is a starred node in the example above, e.g. G*.

    Initialization:

    A dummy node Φ is introduced with an edge from Φ to each of the "root" nodes, e.g. nodes A and B above. The token for Φ is assumed to be a non-printing character, or you can explicitly check before adding it to the output string. The node Φ is pushed onto the stack before calling the function.

    outString := ""
    while stack not empty
       pop token
       if token is node
          outString := outString + node(token)  // Line 5 - explanation below
          if node(token) has children
             if node(token) is destination
                outString := outString + "["
                push "]"
             end
             if node(token) has multiple children
                for each child of node(token), from right to left
                   push ")"
                   push child
                   push "("
                   push "|"
                end
                pop // remove last "|"
             else
                push child
             end
          end
    
       else // token is ()[]|
          outString := outString + token
       end
    end
    

    The output of this algorithm for the first part of your graph (A and its children) is (with extra spaces added for clarity; the spaces can be easily added to the code):

    A (C G [J]) | (D (G [J]) | (K))
    

    You'll notice a deviation between your result and mine: the final node K is enclosed in parentheses in my solution. If this is undesirable (it could result in ugliness like A[(B)|(C)]), you can eliminate it by performing an additional check when you pop a node token off of the stack at the cost of some additional overhead. Simply replace Line 5 above with:

    if (node(token) has no children
        AND last character of outString is "("
        AND next token on stack is ")")
          remove trailing "(" from outString
          concatenate token to outString
          pop ")" from stack and ignore
    else
       outString := outString + node(token) // as above
    end
    

    Let me know if you have any questions or I've missed anything.

    * This will happen in the (probably highly unlikely) case of a node being written as |[(A)]. Most nodes will take up 4 or fewer characters in the stack.