I'm trying to find the right approach to graphing a dataset that contains information on amount of time users typically spend in various locations. Importantly, there are categories and subcategories with increasing levels of granularity to my data (for example, 60% of people are at "home", and of those 40% are in the "living room"). I am aware of TreeMaps which would display the information and relationships I need, but I have been asked to make a "network" visualization of the data.
What I specifically am looking for is a graphing approach in Python that would allow me to visualize my data with the nodes (better yet, the node labels) automatically sized according to the number of users that fall within its category. Importantly, all the child node counts would also be counted in the parent nodes as well (so dendrograms aren't really an option because I need to display information at every branching point).
My data looks somewhat like this (note that some locations get more granular than other):
| ID | BUILDING | subcat01 | subcat02 |
----------------------------------------
| 00 | home | kitchen | fridge |
| 01 | office | desk | NaN |
| 02 | office | reception | NaN |
| 03 | home | bedroom | bed |
| 04 | home | yard | NaN |
| 05 | home | livingroom| couch |
| 06 | office | conf_room | NaN |
| 07 | outdoors | NaN | NaN |
|... | ... | ... | ... |
For a very rough approximation of what I want to produce, see the image below. The important thing is that I'm able to size the nodes according to the sum of their children (or just themselves if its an end node). I will be running lots of iterations with different filters, so I need something that I can easily iterate rather than just manually coding the appearance of each graph.
Any suggestions on which Python libraries might best accomplish this? I've briefly looked into networkX, graph-tool, and etetoolkit, but I'm not sure if any of them have exactly the functionality I'm seeking.
Here's a rough approximation of what I want to produce:
To generate the graph, you could set the rows as paths of a directed graph. A simple way could be by defining a pandas dataframe and stacking to remove the missing values:
import networkx as nx
from networkx.drawing.nx_agraph import graphviz_layout
from pylab import rcParams # from package matplotlib
import pandas as pd
#df = pd.read_csv....
paths = df.loc[:,'BUILDING':].stack().groupby(level=0).agg(list).values.tolist()
# [['home', 'kitchen', 'fridge'], ['office', 'desk'], ['office', 'reception'],...
Note that stack is useful here since it ignores NaNs, then we can just gorupby
on the index and aggregate as lists. Then create a directed graph and set the paths with nx.add_path
:
G = nx.DiGraph()
for path in paths:
nx.add_path(G, path)
Now to visualize the graph as a tree-like layout, we could use graphviz_layout
, which is basically a wrapper for pygraphviz_layout
:
rcParams['figure.figsize'] = 14, 10
pos=graphviz_layout(G, prog='dot')
nx.draw(G, pos=pos,
node_color='lightgreen',
node_size=1500,
with_labels=True,
arrows=True)
If you wanted to add a common source node for all buildings, you could insert a column named ALL
right after ID
:
df.insert(1, 'ALL', 'ALL')
paths = df.loc[:,'ALL':].stack().groupby(level=0).agg(list).values.tolist()
And then just do as above, where you'd now get:
Note that there are several other graphviz layout programs which may resemble more what you have in mind. For instance circo
:
pos=graphviz_layout(G, prog='circo')
nx.draw(G, pos=pos,
node_color='lightgreen',
node_size=1500,
with_labels=True,
arrows=True)