Search code examples
machine-learningscikit-learncluster-analysisdata-miningdbscan

Closest core sample DBSCAN in scikit


I would like to find the closest core sample for each datapoint. This way I could represent my data with only core examples (reduce the dataset)

Scikit seems to be only providing an array of all the core samples. The brute force way to compare my datapoint to this array, is to heavy weighted. So I would like to get the core samples for one cluster, get the cluster number for a datapoint, and then get the closest core sample.


Solution

  • I don't think DBSCAN is meant to be used this way (data reduction).

    But in particular, DBSCAN does not compute the nearest core point. So it does not have the information you are looking for!

    You'll have to do it yourself.

    1. Put all core points into a kdtree/balltree
    2. Find the nearest neighbor using the index

    Scikit-learn provides everything you need already, it should be just a few lines.