I'm trying to use agglomerative clustering with a custom distance metric (ie affinity) since I'd like to cluster a sequence of integers by sequence similarity and not something like the euclidean distance which isn't meaningful.
My data looks something like this
>> dat.values
array([[860, 261, 240, ..., 300, 241, 1],
[860, 840, 860, ..., 860, 240, 1],
[260, 860, 260, ..., 260, 220, 1],
...,
[260, 260, 260, ..., 260, 260, 1],
[260, 860, 260, ..., 840, 860, 1],
[280, 240, 241, ..., 240, 260, 1]])
I've created the following similarity function
def sim(x, y):
return np.sum(np.equal(np.array(x), np.array(y)))/len(x)
So I just return the % matching values in the two sequences with numpy and make the following call
cluster = AgglomerativeClustering(n_clusters=5, affinity=sim, linkage='average')
cluster.fit(dat.values)
But I'm getting an error saying
TypeError: sim() missing 1 required positional argument: 'y'
I'm not sure why I'm getting this error; I thought the function will cluster pairs of rows so each required argument would be passed.
Any help with this would be greatly appreciated
'affinity'
as a callable requires a single input X
(which is your feature or observation matrix) and then call the distances between all the points (samples) inside it.
So you need to modify your method as:
# Your method to calculate distance between two samples
def sim(x, y):
return np.sum(np.equal(np.array(x), np.array(y)))/len(x)
# Method to calculate distances between all sample pairs
from sklearn.metrics import pairwise_distances
def sim_affinity(X):
return pairwise_distances(X, metric=sim)
cluster = AgglomerativeClustering(n_clusters=5, affinity=sim_affinity, linkage='average')
cluster.fit(X)
Or you can use affinity='precomputed'
as @avchauzov has suggested. For that you will have to pass the pre-calculated distance matrix for your observations in fit()
. Something like:
cluster = AgglomerativeClustering(n_clusters=5, affinity='precomputed', linkage='average')
distance_matrix = sim_affinity(X)
cluster.fit(distance_matrix)
Note: You have specified similarity in place of distance. So make sure you understand how the clustering will work here. Or maybe tweak your similarity function to return distance. Something like:
def sim(x, y):
# Subtracted from 1.0 (highest similarity), so now it represents distance
return 1.0 - np.sum(np.equal(np.array(x), np.array(y)))/len(x)