I have a list of Connections between two nodes describing similarities of Entries in a Dataset.
I'm thinking of vizualising the Entries and their connections to show that there are clusters of very similar entries.
Each tuple stands for a pair of very similar nodes. I've chosen weight as 1 for all of them since it's required but I want all edges equally thick.
I've started with networkx, problem is I don't really now how to cluster the similar nodes together in a useful manner.
I have a List of the connections in a Dataframe:
smallSample =
[[0, 1492, 1],
[12, 937, 1],
[16, 989, 1],
[18, 371, 1],
[18, 1140, 1],
[26, 398, 1],
[26, 1061, 1],
[30, 1823, 1],
[33, 1637, 1],
[54, 1047, 1],
[63, 565, 1]]
I Create a graph the following way:
import networkx as nx
import matplotlib.pyplot as plt
G = nx.Graph()
for index, row in CC.iterrows():
G.add_edge(CC['source'].loc[index],CC['target'].loc[index], weight =1)
pos = nx.spring_layout(G, seed=7)
nx.draw_networkx_nodes(G, pos, node_size=5)
nx.draw_networkx_edges(G, pos, edgelist=G.edges(), width=0.5)
pos = nx.spring_layout(G, k=1, iterations=200)
plt.figure(3, figsize=(2000,2000), dpi =2)
With the small sample provided above the result looks like this:
The result from my real df which consists of thousands of points:
How can I Group the linked nodes together so that it is better visible how many of them are in each cluster? I dont want them to overlap so hard, its really not that easy to grasp how many of them are there specially in the big sample.
From an InfoVis perspective there are a few things you can do
k
parameter for nx.spring_layout
, the larger it is the further away are the nodes. The default is 1/sqrt(len(G))
a slight increase [1.2-1.7]/sqrt(len(G))
can give you some more clarity.Last but not least I would suggest jitter for you that shuffles the position of nodes a bit and lessens overlap (there are many papers on jitter and some better versions than just uniform that I choose here, however it is the most simplest to implement.)
This code creates a similar looking dataset
import random
import numpy as np
import pandas as pd
from copy import deepcopy
import networkx as nx
import matplotlib.pyplot as plt
from math import sqrt
random.seed(7)
np.random.seed(7)
# Create a bigger dataset
smallSample = [
[0, 1492, 1],
[12, 937, 1],
[16, 989, 1],
[18, 371, 1],
[18, 1140, 1],
[26, 398, 1],
[26, 1061, 1],
[30, 1823, 1],
[33, 1637, 1],
[54, 1047, 1],
[63, 565, 1]]
sample = deepcopy(smallSample)
AMOUT = 4000
present_nodes = list(set(x for edge in sample for x in edge))
i = 2
while i < AMOUT:
source = target = None
while source == target:
if random.random() < 0.9:
# Create at least one new node
source = i
if random.random() < 0.7: # High value for many small clusters
# Create a second new node
target = i = i+1
present_nodes.append(target)
else:
target = random.choice(present_nodes)
present_nodes.append(source)
else: # Link existing ones
source = random.choice(present_nodes)
target = random.choice(present_nodes)
i += 1
sample.append([source, target, 1])
CC = pd.DataFrame(sample, columns=["source", "target", "weight"], dtype=int)
# Create the Graph
G = nx.Graph()
for index, row in CC.iterrows():
G.add_edge(CC['source'].loc[index],CC['target'].loc[index], weight =1)
Calcualte Positions
# Defaul k = 1/sqrt(len(G))
pos = nx.spring_layout(G, k=1/sqrt(len(G)), seed=7, iterations=100)
# cast the pos dict to an np.array
apos = np.fromiter(pos.values(), dtype=np.dtype((float, 2)))
Default Look
nx.draw_networkx_nodes(G, pos, node_size=10, alpha=0.45, linewidths=0.2)
nx.draw_networkx_edges(G, pos, edgelist=G.edges(), width=0.5, alpha=0.2)
plt.title("Transparency")
plt.figure(3, figsize=(2000,2000), dpi =2)
This increases the distances between the nodes and makes it less clumpy
pos15 = nx.spring_layout(G, k=1.5/sqrt(len(G)), seed=7, iterations=100) # Larger k to make it less clumpy
# cast the pos dict to an np.array
apos15 = np.fromiter(pos15.values(), dtype=np.dtype((float, 2)))
nx.draw_networkx_nodes(G, pos15, node_size=10, alpha=0.55, linewidths=0.2)
nx.draw_networkx_edges(G, pos15, edgelist=G.edges(), width=0.5, alpha=0.2)
plt.title("Larger k")
plt.figure(3, figsize=(2000,2000), dpi =2)
JITTER = 0.025
jitter = np.random.uniform(low=-JITTER, high=JITTER, size=apos.shape)
jpos = {k:p for k,p in zip(pos.keys(), apos + jitter)}
jpos15 = {k:p for k,p in zip(pos15.keys(), apos15 + jitter)}
nx.draw_networkx_nodes(G, jpos, node_size=10, alpha=0.45, linewidths=0.2)
nx.draw_networkx_edges(G, jpos, edgelist=G.edges(), width=0.5, alpha=0.2)
plt.title("default + jitter")
plt.figure(3, figsize=(2000,2000), dpi =2)
plt.show()
nx.draw_networkx_nodes(G, jpos15, node_size=10, alpha=0.55, linewidths=0.2) # As nodes overlapp less I would increase the alpha level a bit
nx.draw_networkx_edges(G, jpos15, edgelist=G.edges(), width=0.5, alpha=0.2)
plt.title("larger k + jitter")
plt.figure(3, figsize=(2000,2000), dpi =2)
In the end it is some playing around with the parameter to choose something you like.