Search code examples
pythonpandasmatplotlibchartsnetworkx

Visualize Nodes and Their Connections in Clusters via networkx


I have a list of Connections between two nodes describing similarities of Entries in a Dataset.

I'm thinking of vizualising the Entries and their connections to show that there are clusters of very similar entries.

Each tuple stands for a pair of very similar nodes. I've chosen weight as 1 for all of them since it's required but I want all edges equally thick.

I've started with networkx, problem is I don't really now how to cluster the similar nodes together in a useful manner.

I have a List of the connections in a Dataframe:

smallSample = 
[[0, 1492, 1],
 [12, 937, 1],
 [16, 989, 1],
 [18, 371, 1],
 [18, 1140, 1],
 [26, 398, 1],
 [26, 1061, 1],
 [30, 1823, 1],
 [33, 1637, 1],
 [54, 1047, 1],
 [63, 565, 1]]

I Create a graph the following way:

import networkx as nx
import matplotlib.pyplot as plt
G = nx.Graph()
for index, row in CC.iterrows():
      G.add_edge(CC['source'].loc[index],CC['target'].loc[index], weight =1)
pos = nx.spring_layout(G, seed=7)
nx.draw_networkx_nodes(G, pos, node_size=5)
nx.draw_networkx_edges(G, pos, edgelist=G.edges(), width=0.5)
pos = nx.spring_layout(G, k=1, iterations=200)
plt.figure(3, figsize=(2000,2000), dpi =2) 

With the small sample provided above the result looks like this:

Small Sample

The result from my real df which consists of thousands of points:

Big Sample

How can I Group the linked nodes together so that it is better visible how many of them are in each cluster? I dont want them to overlap so hard, its really not that easy to grasp how many of them are there specially in the big sample.


Solution

  • From an InfoVis perspective there are a few things you can do

    • transparency & node size
      Transparency can be used to visualize overlapping. You have to choose between these two tradeoffs: A lower transparency level allows you to visualize more layers, for that many nodes need to overlap and you should increase the node size. However, a larger node size makes individual nodes stick out less and the visualization of node edges adds clutter (disable or use less tick edges).
      TL;DR: Choose/Play between smaller node size and high alpha values vs. larger node sizes and lower alpha values.
    • play with the k parameter for nx.spring_layout, the larger it is the further away are the nodes. The default is 1/sqrt(len(G)) a slight increase [1.2-1.7]/sqrt(len(G)) can give you some more clarity.

    Last but not least I would suggest jitter for you that shuffles the position of nodes a bit and lessens overlap (there are many papers on jitter and some better versions than just uniform that I choose here, however it is the most simplest to implement.)

    Some recreation of the dataset

    This code creates a similar looking dataset

    import random
    import numpy as np
    import pandas as pd
    from copy import deepcopy
    import networkx as nx
    import matplotlib.pyplot as plt
    from math import sqrt
    
    random.seed(7)
    np.random.seed(7)
    
    # Create a bigger dataset
    
    smallSample = [
     [0, 1492, 1],
     [12, 937, 1],
     [16, 989, 1],
     [18, 371, 1],
     [18, 1140, 1],
     [26, 398, 1],
     [26, 1061, 1],
     [30, 1823, 1],
     [33, 1637, 1],
     [54, 1047, 1],
     [63, 565, 1]]
    
    sample = deepcopy(smallSample)
    
    AMOUT = 4000
    
    present_nodes = list(set(x for edge in sample for x in edge))
    i = 2
    while i < AMOUT:
        source = target = None
        while source == target:
            if random.random() < 0.9:
                # Create at least one new node
                source = i
                if random.random() < 0.7: # High value for many small clusters
                    # Create a second new node
                    target = i = i+1
                    present_nodes.append(target)
                else:
                    target = random.choice(present_nodes)
                present_nodes.append(source)
            else: # Link existing ones
                source = random.choice(present_nodes)
                target = random.choice(present_nodes)
        i += 1
        sample.append([source, target, 1])
    
    CC = pd.DataFrame(sample, columns=["source", "target", "weight"], dtype=int)
    
    # Create the Graph
    G = nx.Graph()
    for index, row in CC.iterrows():
          G.add_edge(CC['source'].loc[index],CC['target'].loc[index], weight =1)
    

    Calcualte Positions

    # Defaul k = 1/sqrt(len(G))
    pos = nx.spring_layout(G, k=1/sqrt(len(G)), seed=7, iterations=100)
    # cast the pos dict to an np.array
    apos = np.fromiter(pos.values(), dtype=np.dtype((float, 2)))
    

    Default Look

    default

    Transparency

    nx.draw_networkx_nodes(G, pos, node_size=10, alpha=0.45, linewidths=0.2)
    nx.draw_networkx_edges(G, pos, edgelist=G.edges(), width=0.5, alpha=0.2)
    plt.title("Transparency")
    plt.figure(3, figsize=(2000,2000), dpi =2) 
    

    enter image description here

    Use a larger k value

    This increases the distances between the nodes and makes it less clumpy

    pos15 = nx.spring_layout(G, k=1.5/sqrt(len(G)), seed=7, iterations=100) # Larger k to make it less clumpy
    
    # cast the pos dict to an np.array
    apos15 = np.fromiter(pos15.values(), dtype=np.dtype((float, 2)))
    
    nx.draw_networkx_nodes(G, pos15, node_size=10, alpha=0.55, linewidths=0.2)
    nx.draw_networkx_edges(G, pos15, edgelist=G.edges(), width=0.5, alpha=0.2)
    plt.title("Larger k")
    plt.figure(3, figsize=(2000,2000), dpi =2) 
    

    enter image description here

    Adding Jitter

    JITTER = 0.025
    jitter = np.random.uniform(low=-JITTER, high=JITTER, size=apos.shape)
    jpos = {k:p for k,p in zip(pos.keys(), apos + jitter)}
    jpos15 = {k:p for k,p in zip(pos15.keys(), apos15 + jitter)}
    
    nx.draw_networkx_nodes(G, jpos, node_size=10, alpha=0.45, linewidths=0.2)
    nx.draw_networkx_edges(G, jpos, edgelist=G.edges(), width=0.5, alpha=0.2)
    plt.title("default + jitter")
    plt.figure(3, figsize=(2000,2000), dpi =2) 
    plt.show()
    
    nx.draw_networkx_nodes(G, jpos15, node_size=10, alpha=0.55, linewidths=0.2)  # As nodes overlapp less I would increase the alpha level a bit
    nx.draw_networkx_edges(G, jpos15, edgelist=G.edges(), width=0.5, alpha=0.2)
    plt.title("larger k + jitter")
    plt.figure(3, figsize=(2000,2000), dpi =2)
    

    adding jitter larger k + jitter


    In the end it is some playing around with the parameter to choose something you like.