Search code examples
pythonnetworkxgraphvizpydotaltair

How to visualize word patterns?


I have a data structure that looks like this:

<client>: {
    <document>: [
        {'start': <datetime>,
         'end': <datetime>,
         'group': <string>}
    ]
 }

The list of dictionaries within a <document> is sorted by the 'start' date, and a new entry cannot start before the one before it ends. I iterate over this data structure and I collect the values of group as time progresses into a new structure, e.g.:

<client>: {
    <document>: {'progression': <group_1>|<group_2>|...|<group_n>}
 }

where <group_1> corresponds to the value of 'group' for the first dictionary in <document>, and so on. I want to visualize this progression of groups for all documents, so for example I know that I have 5,000 entries starting with "abc" (before the first pipe); out of those, 2,000 are followed by "def", so "abc"|"def". Of those, 500 revert back to "abc": "abc"|"def"|"abc" and the remaining 1,500 are followed by "ghi": "abc"|"def"|"ghi". The remaining 3,000 entries starting with "abc" follow some different progression pattern.

What I am trying to do is visualize this progression via something looking like a Sankey diagram, or an other appropriate tree-like structure, in which the top node would be "abc", then there would be a "thick" branch to the left corresponding to the different progression pattern, and a "thinner" branch to the right corresponding to the 2,000 "abc" cases followed by "def". Then "def" would be another node with similar branches, one leading to a new "abc" (for the "abc"|"def"|"abc" case) and one leading to "ghi" (for the "abc"|"def"|"ghi" case), preferably annotated with the count in each node as the "tree" thins down. I use a combination of Python Counter structures to retrieve the numbers for each potential progression, but I do not know how I can create a visualization programmatically.

My understanding is that it is probably a problem that can be addressed using dot language, and packages like pydot and/or pygraphviz, but I am not sure whether I am on the right track.


Solution

  • I think in your case Sankey diagrams will be the best choice. Let's suppose you are have data structure that stores your groups info from here: 'progression': <group_1>|<group_2>|...|<group_n>. Then you can construct Sankey diagram like this:

    data = [
        [1,2,3,1,4],
        [1,4,2],
        [1,2,5,3,5],
        [1,3],
        [1,4,5,1,4,3],
        [1,5,4,3],
        [1,2,5,1,3,4],
        [1,5],
        [1,2,1,1,5,2],
        [1,5,4,3],
        [1,1,2,3,4,1]
    ]
    
    # Append _1, _2... indices to differ paths like 1-2-2-1 and 1-2-1-2
    nodes = sorted(list(set(itertools.chain(*[[str(e) + '_' + str(i) for i, e in enumerate(l)] for l in data]))))
    countered = defaultdict(int)
    for line in data:
        for i in range(len(line) - 1):
            countered[(str(line[i]) + '_' + str(i), str(line[i+1]) + '_' + str(i+1))] += 1
    links = [
        {'source': key[0], 'target': key[1], 'value': value}
        for key, value in countered.items()
    ]
    
    links = {
        'source': [nodes.index(key[0]) for key, value in countered.items()],
        'target': [nodes.index(key[1]) for key, value in countered.items()],
        'value': [value for key, value in countered.items()]
    }
    
    data_trace = dict(
        type='sankey',
        domain = dict(
          x =  [0,1],
          y =  [0,1]
        ),
        orientation = "h",
        valueformat = ".0f",
        node = dict(
          pad = 10,
          thickness = 30,
          line = dict(
            color = "black",
            width = 0
          ),
          label =  nodes
        ),
        link = links
    )
    
    layout =  dict(
        title = "___",
        height = 772,
        font = dict(
          size = 10
        ),    
    )
    
    fig = dict(data=[data_trace], layout=layout)
    iplot(fig, validate=True)
    

    It will draw you a Sankey plot like this:

    enter image description here

    You can find more info about how Sankey in plotly works here.