I want to grid search over a set of hyper parameters to tune a clustering model. GridSearchCV
offers a bunch of scoring functions for unsupervised learning but I want to use a function that's not in there, e.g. silhouette score.
The documentation on how to implement my custom function is unclear on how we should define our scoring function. The example there shows simply importing a custom scorer and using make_scorer
to create a custom scoring function. However, make_scorer
seems to require the true values (which doesn't exist in unsupervised learning), so it's not clear how to use it.
Here's what I have so far:
from sklearn.datasets import make_blobs
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score, make_scorer
def my_custom_function(model, X):
preds = model.predict(X)
return silhouette_score(X, preds)
Z, _ = make_blobs()
model = DBSCAN()
pgrid = {'eps': [0.1*i for i in range(1,6)]}
gs = GridSearchCV(model, pgrid, scoring=my_custom_function)
gs.fit(Z)
best_score = gs.score(Z)
But it throws two errors:
TypeError: my_custom_function() takes 2 positional arguments but 3 were given
and
AttributeError: 'DBSCAN' object has no attribute 'predict'
How do I correctly define my custom scoring function?
There is no predict
method but you can make a custom one.
One approach is to iterate through the core points and assign your new point to the cluster of the first core point that falls within a specified margin, denoted as 'eps.' This ensures that your point will, at the very least, be classified as a border point for the cluster it's assigned to, based on the clustering definitions in use.
import scipy as sp, numpy as np
from sklearn.datasets import make_blobs
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score, make_scorer
def dbscan_predict(dbscan_model, X_new, metric=sp.spatial.distance.cosine):
# Result is noise by default
y_new = np.ones(shape=len(X_new), dtype=int)*-1
# Iterate all input samples for a label
for j, x_new in enumerate(X_new):
# Find a core sample closer than EPS
for i, x_core in enumerate(dbscan_model.components_):
if metric(x_new, x_core) < dbscan_model.eps:
# Assign label of x_core to x_new
y_new[j] = dbscan_model.labels_[dbscan_model.core_sample_indices_[i]]
break
return y_new
def my_custom_function(model, X, y=None):
# for models that implement it, e.g. KMeans, could use `predict` instead
preds = dbscan_predict(model, X)
return silhouette_score(X, preds) if len(set(preds)) > 1 else float('nan')
model = DBSCAN()
pgrid = {
'eps': [0.1*i for i in range(1,8)],
'min_samples': range(2,5)
}
Z, _ = make_blobs(400, random_state=0)
gs = GridSearchCV(model, pgrid, scoring=my_custom_function)
gs.fit(Z)
best_estimator = gs.best_estimator_
best_score = gs.score(Z)