Search code examples
scikit-learnhierarchical-clusteringunsupervised-learningdbscanhdbscan

Can we refit or fit in in parts clustering algorithms?


  • I want to cluster big data set (more than 1M records).
  • I want to use dbscan or hdbscan algorithms for this clustering task.

When I try to use one of those algorithms, I'm getting memory error.

  • Is there a way to fit big data set in parts ? (go with for loop and refit every 1000 records) ?
  • If no, is there a better way to cluster big data set, without upgrading the machine memory ?

Solution

  • If the number of features in your dataset is not too much (below 20-25), you can consider using BIRCH. It's an iterative method that can be used for large datasets. In each iteration it builds a tree with only a small sample of data and put each instance into clusters.