Search code examples
pythonmachine-learningscikit-learncluster-analysis

Custom criteria for kmeans in scikit-learn


I would like to run a clustering algorithm in scikit-learn and use it in a standard pipeline (i.e., I need to write it in . For this clustering algorithm, I would like to run kmeans N times (i.e., with N different initial points), and then use my own function to choose the best run. The currently implemented version of kmeans has a built in way to run with N iterations and to choose the best based on minimized within-cluster variances. Essentially I want to copy this kmeans function, but use a different criteria for the "best" fit.

I'm trying to figure out the best way to do this. A promising approach seems to be to write my own estimator (e.g., using the tools at https://github.com/scikit-learn-contrib/project-template/). It seems that this estimator would need to implement fit, fit_predict, fit_transform, get_params, predict, score, set_params, and transform. In my mind, this estimator could just run kmeans N times internally, then return the single best centroid fit per my criteria.

Is there a simpler way to do this?


Solution

  • Have you considered using inheritance?

    You can do OOP in Python. So you'd override only the outer loop of the sklearn KMeans class, and inherit everything else.