Search code examples
bashawkgraphvizdot

AWK recursive tree structure


I'm trying to parse a file that contains lines in a hierarchical structure. For example the file:

a b c
a b d
a B C
A B C

indicates that a contains b and B, that b contains c and d, that B contains C. A contains a different B which contains its own C.

This is much like a list of files.

I want to format this in a hierarchical bracketed way like:

a {
    b {
        c
        d
    }
    B {
        C
    }
}
A {
    B {
        C
    }
}

I couldn't come up with a decent way to do this. I thought that AWK would be my best bet, but came up short with how to actually implement it.

Context

My input is actually a list of files. I can of course separate the fields by spaces if needed, or keep them with /. The files are unordered and generated from a code-base during compile-time via inspection. My desired output is going to be a graphviz DOT file containing each file in its own subgraph.

Thus for the input:

a/b/c
a/b/d
a/B/C
A/B/C

the output would be

digraph {
  subgraph cluster_a {
    label = a
    subgraph cluster_b {
        label = b
        node_1 [label=c]
        node_2 [label=d]
    }
    subgraph cluster_B {
        label = B
        node_3 [label=C]
    }
  }
  subgraph cluster_A {
      label = A
      subgraph cluster_B {
          label = B
          node_4 [label=C]
      }
  }
}

graphviz output

Does anybody know how I could get this processing done? I'm open to other tools as well, not just AWK.

NOTE: Depth is not fixed, though I could pre-compute the maximum depth if necessary. Not all leaves will be at the same depth either.


Solution

  • I'm open to other tools as well, not just AWK.

    I offer this Python solution:

    import sys
    
    INDENT = '  '
    NODE_COUNT = 1
    
    
    def build(node, l):
        x = l[0]
        if x not in node:
            node[x] = {}
    
        if len(l) > 1:
            build(node[x], l[1:])
    
    
    def indent(s, depth):
        print('%s%s' % (INDENT * depth, s))
    
    
    def print_node(label, value, depth):
    
        if len(value.keys()) > 0:
            indent('subgraph cluster_%s {' % label, depth)
            indent('  label = %s' % label, depth)
            for child in value:
                print_node(child, value[child], depth+1)
            indent('}', depth)
        else:
            global NODE_COUNT
            indent('node_%d [label=%s]' % (NODE_COUNT, label), depth)
            NODE_COUNT += 1
    
    
    def main():
    
        d = {}
    
        for line in sys.stdin:
            build(d, [x.strip() for x in line.split()])
    
        print('digraph {')
        for k in d.keys():
            print_node(k, d[k], 1)
        print('}')
    
    
    if __name__ == '__main__':
        main()
    

    Result:

    $ cat rels.txt
    a b c
    a b d
    a B C
    A B C
    
    $ cat rels.txt | python3 make_rels.py
    digraph {
      subgraph cluster_a {
        label = a
        subgraph cluster_b {
          label = b
          node_1 [label=c]
          node_2 [label=d]
        }
        subgraph cluster_B {
          label = B
          node_3 [label=C]
        }
      }
      subgraph cluster_A {
        label = A
        subgraph cluster_B {
          label = B
          node_4 [label=C]
        }
      }
    }