python scikit-learn cluster-analysis grid-search

How to use a custom scoring function in GridSearchCV for unsupervised learning

I want to grid search over a set of hyper parameters to tune a clustering model. GridSearchCV offers a bunch of scoring functions for unsupervised learning but I want to use a function that's not in there, e.g. silhouette score.

The documentation on how to implement my custom function is unclear on how we should define our scoring function. The example there shows simply importing a custom scorer and using make_scorer to create a custom scoring function. However, make_scorer seems to require the true values (which doesn't exist in unsupervised learning), so it's not clear how to use it.

Here's what I have so far:

from sklearn.datasets import make_blobs
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score, make_scorer

def my_custom_function(model, X):
    preds = model.predict(X)
    return silhouette_score(X, preds)

Z, _ = make_blobs()

model = DBSCAN()
pgrid = {'eps': [0.1*i for i in range(1,6)]}
gs = GridSearchCV(model, pgrid, scoring=my_custom_function)
gs.fit(Z)
best_score = gs.score(Z)

But it throws two errors:

TypeError: my_custom_function() takes 2 positional arguments but 3 were given

and

AttributeError: 'DBSCAN' object has no attribute 'predict'

How do I correctly define my custom scoring function?

Solution

There is no predict method but you can make a custom one.

One approach is to iterate through the core points and assign your new point to the cluster of the first core point that falls within a specified margin, denoted as 'eps.' This ensures that your point will, at the very least, be classified as a border point for the cluster it's assigned to, based on the clustering definitions in use.

import scipy as sp, numpy as np
from sklearn.datasets import make_blobs
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score, make_scorer

def dbscan_predict(dbscan_model, X_new, metric=sp.spatial.distance.cosine):
    # Result is noise by default
    y_new = np.ones(shape=len(X_new), dtype=int)*-1 

    # Iterate all input samples for a label
    for j, x_new in enumerate(X_new):
        # Find a core sample closer than EPS
        for i, x_core in enumerate(dbscan_model.components_): 
            if metric(x_new, x_core) < dbscan_model.eps:
                # Assign label of x_core to x_new
                y_new[j] = dbscan_model.labels_[dbscan_model.core_sample_indices_[i]]
                break

    return y_new

def my_custom_function(model, X, y=None):
    # for models that implement it, e.g. KMeans, could use `predict` instead
    preds = dbscan_predict(model, X)
    return silhouette_score(X, preds) if len(set(preds)) > 1 else float('nan')

model = DBSCAN()
pgrid = {
    'eps': [0.1*i for i in range(1,8)],
    'min_samples': range(2,5)
}

Z, _ = make_blobs(400, random_state=0)
gs = GridSearchCV(model, pgrid, scoring=my_custom_function)
gs.fit(Z)
best_estimator = gs.best_estimator_
best_score = gs.score(Z)

ref: https://stackoverflow.com/a/35458920/5025009