pandas scipy networkx kdtree connected-components

Connected components on Pandas DataFrame with Networkx

Action To cluster points based on distance and label using connected components.

Problem The back and forth switching between NetworkX nodes storage of attributes and Pandas DataFrame

Seems too complex
Index/key errors when looking up nodes

Tried Using different functions like Scikit NearestNeighbours, however resulting in the same back and forth moving of data.

Question Is there a simpler way to perform this connected components operation?

Example

import numpy as np
import pandas as pd
import dask.dataframe as dd
import networkx as nx
from scipy import spatial

#generate example dataframe
pdf = pd.DataFrame({'x':[1.0,2.0,3.0,4.0,5.0],
                    'y':[1.0,2.0,3.0,4.0,5.0], 
                    'z':[1.0,2.0,3.0,4.0,5.0], 
                    'label':[1,2,1,2,1]}, 
                   index=[1, 2, 3, 4, 5])
df = dd.from_pandas(pdf, npartitions = 2)

object_id = 0
def cluster(df, object_id=object_id):
    # create kdtree
    tree = spatial.cKDTree(df[['x', 'y', 'z']])

    # get neighbours within distance for every point, store in dataframe as edges
    edges = pd.DataFrame({'src':[], 'tgt':[]}, dtype=int)
    for source, target in enumerate(tree.query_ball_tree(tree, r=2)):
        target.remove(source)
        if target:
            edges = edges.append(pd.DataFrame({'src':[source] * len(target), 'tgt':target}), ignore_index=True)

    # create graph for points using edges from Balltree query
    G = nx.from_pandas_dataframe(edges, 'src', 'tgt')

    for i in sorted(G.nodes()):
        G.node[i]['label'] = nodes.label[i]
        G.node[i]['x'] = nodes.x[i]
        G.node[i]['y'] = nodes.y[i]
        G.node[i]['z'] = nodes.z[i]

    # remove edges between points of different classes
    G.remove_edges_from([(u,v) for (u,v) in G.edges_iter() if G.node[u]['label'] != G.node[v]['label']])

    # find connected components, create dataframe and assign object id
    components = list(nx.connected_component_subgraphs(G))
    df_objects = pd.DataFrame()

    for c in components:
        df_object = pd.DataFrame([[i[0], i[1]['x'], i[1]['y'], i[1]['z'], i[1]['label']] for i in c.nodes(data=True)]
                                 , columns=['point_id', 'x', 'y', 'z', 'label']).set_index('point_id')
        df_object['object_id'] = object_id
        df_objects.append(df_object)
        object_id += 1

    return df_objects

meta = pd.DataFrame(np.empty(0, dtype=[('x',float),('y',float),('z',float), ('label',int), ('object_id', int)]))
df.apply(cluster, axis=1, meta=meta).head(10)

Solution

You can use DBSCAN from scikit-learn. For min_samples=1 it basically finds connected components. It can use different algorithms for nearest neighbors computation and is configured through the parameter algorithm (kd-tree is one of the options).

My other suggestion is to do the computation separately for different labels. This simplifies the implementation and allows for parallelization.

These two suggestions can be implemented as follows:

from sklearn.cluster import DBSCAN

def add_cluster(df, distance):
    db = DBSCAN(eps=distance, min_samples=1).fit(df[["x", "y", ...]])
    return df.assign(cluster=db.labels_)

df = df.groupby("label", group_keys=False).apply(add_cluster, distance)

It should work for both Pandas and Dask dataframes. Note that the cluster-id starts from 0 for each label, i.e. a cluster is uniquely identified by the tuple (label, cluster).

Here is a complete example with artificial data:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN

plt.rc("figure", dpi=100)
plt.style.use("ggplot")

# create fake data
centers = [[1, 1], [-1, -1], [1, -1], [-1, 1]]
XY, labels = make_blobs(n_samples=100, centers=centers, cluster_std=0.2, random_state=0)
inp = (
    pd.DataFrame(XY, columns=["x", "y"])
    .assign(label=labels)
    .replace({"label": {2: 0, 3: 1}})
)

def add_cluster(df, distance):
    db = DBSCAN(eps=distance, min_samples=1).fit(df[["x", "y"]])
    return df.assign(cluster=db.labels_)

out = inp.groupby("label", group_keys=False).apply(add_cluster, 0.5)

# visualize
label_marker = ["o", "s"]
ax = plt.gca()
ax.set_aspect('equal')

for (label, cluster), group in out.groupby(["label", "cluster"]):
    plt.scatter(group.x, group.y, marker=label_marker[label])

The resulting dataframe looks like this:

The plot of the clusters looks as follows. Labels are indicated by the marker shape and clusters by the color.