Search code examples
pythonparallel-processingscikit-learnk-means

Sklearn kmeans with multiprocessing


I can't understand how the n_jobs works :

data, labels = sklearn.datasets.make_blobs(n_samples=1000, n_features=416, centers=20)
k_means = sklearn.cluster.KMeans(n_clusters=10, max_iter=3, n_jobs=1).fit(data)

runs in less than 1sec

with n_jobs = 2, it runs nearly twice as much

with n_jobs = 8, it is so long it never ended on my computer... (I have 8 cores)

Is there something I don't understand with how parallelization works ?


Solution

  • n_jobs specifies the number of concurrent processes/threads should be used for parallelized routines

    From docs

    Some parallelism uses a multi-threading backend by default, some a multi-processing backend. It is possible to override the default backend by using sklearn.utils.parallel_backend.

    With python GIL, more threads does not guarantee better speed. So check if your backend is configured for threads or processes. If it is threads then try changing it to processes (but you will also have the overhead of IPC).

    Again from the docs:

    Whether parallel processing is helpful at improving runtime depends on many factors, and it’s usually a good idea to experiment rather than assuming that increasing the number of jobs is always a good thing. It can be highly detrimental to performance to run multiple copies of some estimators or functions in parallel.

    So n_jobs is not a silver bullet but one has to experiment to see if it works for their estimators and kind of data.