Search code examples
pythonscikit-learncluster-analysisgrid-search

How to use a custom scoring function in GridSearchCV for unsupervised learning


I want to grid search over a set of hyper parameters to tune a clustering model. GridSearchCV offers a bunch of scoring functions for unsupervised learning but I want to use a function that's not in there, e.g. silhouette score.

The documentation on how to implement my custom function is unclear on how we should define our scoring function. The example there shows simply importing a custom scorer and using make_scorer to create a custom scoring function. However, make_scorer seems to require the true values (which doesn't exist in unsupervised learning), so it's not clear how to use it.

Here's what I have so far:

from sklearn.datasets import make_blobs
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score, make_scorer

def my_custom_function(model, X):
    preds = model.predict(X)
    return silhouette_score(X, preds)

Z, _ = make_blobs()

model = DBSCAN()
pgrid = {'eps': [0.1*i for i in range(1,6)]}
gs = GridSearchCV(model, pgrid, scoring=my_custom_function)
gs.fit(Z)
best_score = gs.score(Z)

But it throws two errors:

TypeError: my_custom_function() takes 2 positional arguments but 3 were given

and

AttributeError: 'DBSCAN' object has no attribute 'predict'

How do I correctly define my custom scoring function?


Solution

  • There is no predict method but you can make a custom one.

    One approach is to iterate through the core points and assign your new point to the cluster of the first core point that falls within a specified margin, denoted as 'eps.' This ensures that your point will, at the very least, be classified as a border point for the cluster it's assigned to, based on the clustering definitions in use.

    import scipy as sp, numpy as np
    from sklearn.datasets import make_blobs
    from sklearn.model_selection import GridSearchCV
    from sklearn.cluster import DBSCAN
    from sklearn.metrics import silhouette_score, make_scorer
    
    def dbscan_predict(dbscan_model, X_new, metric=sp.spatial.distance.cosine):
        # Result is noise by default
        y_new = np.ones(shape=len(X_new), dtype=int)*-1 
    
        # Iterate all input samples for a label
        for j, x_new in enumerate(X_new):
            # Find a core sample closer than EPS
            for i, x_core in enumerate(dbscan_model.components_): 
                if metric(x_new, x_core) < dbscan_model.eps:
                    # Assign label of x_core to x_new
                    y_new[j] = dbscan_model.labels_[dbscan_model.core_sample_indices_[i]]
                    break
    
        return y_new
    
    def my_custom_function(model, X, y=None):
        # for models that implement it, e.g. KMeans, could use `predict` instead
        preds = dbscan_predict(model, X)
        return silhouette_score(X, preds) if len(set(preds)) > 1 else float('nan')
    
    model = DBSCAN()
    pgrid = {
        'eps': [0.1*i for i in range(1,8)],
        'min_samples': range(2,5)
    }
    
    Z, _ = make_blobs(400, random_state=0)
    gs = GridSearchCV(model, pgrid, scoring=my_custom_function)
    gs.fit(Z)
    best_estimator = gs.best_estimator_
    best_score = gs.score(Z)
    
    

    ref: https://stackoverflow.com/a/35458920/5025009