Search code examples
pythonscikit-learnhierarchical-clusteringsklearn-pandas

Sklearn Agglomerative Clustering Custom Affinity


I'm trying to use agglomerative clustering with a custom distance metric (ie affinity) since I'd like to cluster a sequence of integers by sequence similarity and not something like the euclidean distance which isn't meaningful.

My data looks something like this

>> dat.values 

array([[860, 261, 240, ..., 300, 241,   1],
   [860, 840, 860, ..., 860, 240,   1],
   [260, 860, 260, ..., 260, 220,   1],
   ...,
   [260, 260, 260, ..., 260, 260,   1],
   [260, 860, 260, ..., 840, 860,   1],
   [280, 240, 241, ..., 240, 260,   1]]) 

I've created the following similarity function

def sim(x, y): 
    return np.sum(np.equal(np.array(x), np.array(y)))/len(x)

So I just return the % matching values in the two sequences with numpy and make the following call

cluster = AgglomerativeClustering(n_clusters=5, affinity=sim, linkage='average')
cluster.fit(dat.values)

But I'm getting an error saying

TypeError: sim() missing 1 required positional argument: 'y'

I'm not sure why I'm getting this error; I thought the function will cluster pairs of rows so each required argument would be passed.

Any help with this would be greatly appreciated


Solution

  • 'affinity' as a callable requires a single input X (which is your feature or observation matrix) and then call the distances between all the points (samples) inside it.

    So you need to modify your method as:

    # Your method to calculate distance between two samples
    def sim(x, y): 
        return np.sum(np.equal(np.array(x), np.array(y)))/len(x)
    
    
    # Method to calculate distances between all sample pairs
    from sklearn.metrics import pairwise_distances
    def sim_affinity(X):
        return pairwise_distances(X, metric=sim)
    
    cluster = AgglomerativeClustering(n_clusters=5, affinity=sim_affinity, linkage='average')
    cluster.fit(X)
    

    Or you can use affinity='precomputed' as @avchauzov has suggested. For that you will have to pass the pre-calculated distance matrix for your observations in fit(). Something like:

    cluster = AgglomerativeClustering(n_clusters=5, affinity='precomputed', linkage='average')
    distance_matrix = sim_affinity(X)
    cluster.fit(distance_matrix)
    

    Note: You have specified similarity in place of distance. So make sure you understand how the clustering will work here. Or maybe tweak your similarity function to return distance. Something like:

    def sim(x, y): 
        # Subtracted from 1.0 (highest similarity), so now it represents distance
        return 1.0 - np.sum(np.equal(np.array(x), np.array(y)))/len(x)