I would like to run a clustering algorithm in scikit-learn and use it in a standard pipeline (i.e., I need to write it in . For this clustering algorithm, I would like to run kmeans N
times (i.e., with N
different initial points), and then use my own function to choose the best run. The currently implemented version of kmeans has a built in way to run with N
iterations and to choose the best based on minimized within-cluster variances. Essentially I want to copy this kmeans function, but use a different criteria for the "best" fit.
I'm trying to figure out the best way to do this. A promising approach seems to be to write my own estimator (e.g., using the tools at https://github.com/scikit-learn-contrib/project-template/). It seems that this estimator would need to implement fit
, fit_predict
, fit_transform
, get_params
, predict
, score
, set_params
, and transform
. In my mind, this estimator could just run kmeans N
times internally, then return the single best centroid fit per my criteria.
Is there a simpler way to do this?
Have you considered using inheritance?
You can do OOP in Python. So you'd override only the outer loop of the sklearn KMeans class, and inherit everything else.