Search code examples
pythonnumpyscikit-learnoptics-algorithm

OPTICS parallelism


I have the following script (optics.py) to estimate clustering with precomuted distances:

from sklearn.cluster import OPTICS
import numpy as np

distances = np.load(r'distances.npy')
clust = OPTICS(metric='precomputed', n_jobs=-1)
clust = clust.fit(distances)

Looking at htop results I can see that only one CPU core is used

enter image description here

despite the fact scikit runs clustering in multiple processes:

enter image description here

Why n_jobs=-1 has not resulted in using all the CPU cores?


Solution

  • I also face this problem. According to some papers (for example this, see abstract), OPTICS is known as challenging to do it in parallel because of its sequential nature. So, probably, sklearn tries to use all cores when you use n_jobs=-1, but there is nothing to run on extra cores.

    Probably you should consider other clustering algorithms, which are more parallelism-friendly, for example @paul-brodersen in comments suggests to use HDBSCAN. But it seems that sklearn does not have such parallel alternative for optics, so you need to use other packages.