python pandas matplotlib charts networkx

Visualize Nodes and Their Connections in Clusters via networkx

I have a list of Connections between two nodes describing similarities of Entries in a Dataset.

I'm thinking of vizualising the Entries and their connections to show that there are clusters of very similar entries.

Each tuple stands for a pair of very similar nodes. I've chosen weight as 1 for all of them since it's required but I want all edges equally thick.

I've started with networkx, problem is I don't really now how to cluster the similar nodes together in a useful manner.

I have a List of the connections in a Dataframe:

smallSample = 
[[0, 1492, 1],
 [12, 937, 1],
 [16, 989, 1],
 [18, 371, 1],
 [18, 1140, 1],
 [26, 398, 1],
 [26, 1061, 1],
 [30, 1823, 1],
 [33, 1637, 1],
 [54, 1047, 1],
 [63, 565, 1]]

I Create a graph the following way:

import networkx as nx
import matplotlib.pyplot as plt
G = nx.Graph()
for index, row in CC.iterrows():
      G.add_edge(CC['source'].loc[index],CC['target'].loc[index], weight =1)
pos = nx.spring_layout(G, seed=7)
nx.draw_networkx_nodes(G, pos, node_size=5)
nx.draw_networkx_edges(G, pos, edgelist=G.edges(), width=0.5)
pos = nx.spring_layout(G, k=1, iterations=200)
plt.figure(3, figsize=(2000,2000), dpi =2)

With the small sample provided above the result looks like this:

The result from my real df which consists of thousands of points:

Big Sample

How can I Group the linked nodes together so that it is better visible how many of them are in each cluster? I dont want them to overlap so hard, its really not that easy to grasp how many of them are there specially in the big sample.

Solution

From an InfoVis perspective there are a few things you can do

transparency & node size
Transparency can be used to visualize overlapping. You have to choose between these two tradeoffs: A lower transparency level allows you to visualize more layers, for that many nodes need to overlap and you should increase the node size. However, a larger node size makes individual nodes stick out less and the visualization of node edges adds clutter (disable or use less tick edges).
TL;DR: Choose/Play between smaller node size and high alpha values vs. larger node sizes and lower alpha values.
play with the k parameter for nx.spring_layout, the larger it is the further away are the nodes. The default is 1/sqrt(len(G)) a slight increase [1.2-1.7]/sqrt(len(G)) can give you some more clarity.

Last but not least I would suggest jitter for you that shuffles the position of nodes a bit and lessens overlap (there are many papers on jitter and some better versions than just uniform that I choose here, however it is the most simplest to implement.)

Some recreation of the dataset

This code creates a similar looking dataset

import random
import numpy as np
import pandas as pd
from copy import deepcopy
import networkx as nx
import matplotlib.pyplot as plt
from math import sqrt

random.seed(7)
np.random.seed(7)

# Create a bigger dataset

smallSample = [
 [0, 1492, 1],
 [12, 937, 1],
 [16, 989, 1],
 [18, 371, 1],
 [18, 1140, 1],
 [26, 398, 1],
 [26, 1061, 1],
 [30, 1823, 1],
 [33, 1637, 1],
 [54, 1047, 1],
 [63, 565, 1]]

sample = deepcopy(smallSample)

AMOUT = 4000

present_nodes = list(set(x for edge in sample for x in edge))
i = 2
while i < AMOUT:
    source = target = None
    while source == target:
        if random.random() < 0.9:
            # Create at least one new node
            source = i
            if random.random() < 0.7: # High value for many small clusters
                # Create a second new node
                target = i = i+1
                present_nodes.append(target)
            else:
                target = random.choice(present_nodes)
            present_nodes.append(source)
        else: # Link existing ones
            source = random.choice(present_nodes)
            target = random.choice(present_nodes)
    i += 1
    sample.append([source, target, 1])

CC = pd.DataFrame(sample, columns=["source", "target", "weight"], dtype=int)

# Create the Graph
G = nx.Graph()
for index, row in CC.iterrows():
      G.add_edge(CC['source'].loc[index],CC['target'].loc[index], weight =1)

Calcualte Positions

# Defaul k = 1/sqrt(len(G))
pos = nx.spring_layout(G, k=1/sqrt(len(G)), seed=7, iterations=100)
# cast the pos dict to an np.array
apos = np.fromiter(pos.values(), dtype=np.dtype((float, 2)))

Default Look

Transparency

nx.draw_networkx_nodes(G, pos, node_size=10, alpha=0.45, linewidths=0.2)
nx.draw_networkx_edges(G, pos, edgelist=G.edges(), width=0.5, alpha=0.2)
plt.title("Transparency")
plt.figure(3, figsize=(2000,2000), dpi =2)

Use a larger k value

This increases the distances between the nodes and makes it less clumpy

pos15 = nx.spring_layout(G, k=1.5/sqrt(len(G)), seed=7, iterations=100) # Larger k to make it less clumpy

# cast the pos dict to an np.array
apos15 = np.fromiter(pos15.values(), dtype=np.dtype((float, 2)))

nx.draw_networkx_nodes(G, pos15, node_size=10, alpha=0.55, linewidths=0.2)
nx.draw_networkx_edges(G, pos15, edgelist=G.edges(), width=0.5, alpha=0.2)
plt.title("Larger k")
plt.figure(3, figsize=(2000,2000), dpi =2)

Adding Jitter

JITTER = 0.025
jitter = np.random.uniform(low=-JITTER, high=JITTER, size=apos.shape)
jpos = {k:p for k,p in zip(pos.keys(), apos + jitter)}
jpos15 = {k:p for k,p in zip(pos15.keys(), apos15 + jitter)}

nx.draw_networkx_nodes(G, jpos, node_size=10, alpha=0.45, linewidths=0.2)
nx.draw_networkx_edges(G, jpos, edgelist=G.edges(), width=0.5, alpha=0.2)
plt.title("default + jitter")
plt.figure(3, figsize=(2000,2000), dpi =2) 
plt.show()

nx.draw_networkx_nodes(G, jpos15, node_size=10, alpha=0.55, linewidths=0.2)  # As nodes overlapp less I would increase the alpha level a bit
nx.draw_networkx_edges(G, jpos15, edgelist=G.edges(), width=0.5, alpha=0.2)
plt.title("larger k + jitter")
plt.figure(3, figsize=(2000,2000), dpi =2)

In the end it is some playing around with the parameter to choose something you like.