Search code examples
pythonscikit-learncluster-computingdbscan

efficient filtering near/inside clusters after they found - python


essentially I applied a DBSCAN algorithm (sklearn) with an euclidean distance on a subset of my original data. I found my clusters and all is fine: except for the fact that I want to keep only values that are far enough from those on which I did not run my analysis on. I have a new distance to test such new stuff with and I wanted to understand how to do it WITHOUT numerous nested loops.

in a picture:

enter image description here

my found clusters are in blue whereas the red ones are the points to which I don't want to be near. the crosses are the points belonging to the cluster that are carved out as they are within the new distance I specified.

now, as much I could do something of the sort:

for i in red_points:
    for j in blu_points:
        if dist(i,j) < given_dist:
            original_dataframe.remove(j)

I refuse to believe there isn't a vectorized method. also, I can't afford to do as above simply because I'll have huge tables to operate upon and I'd like to avoid my CPU to evaporate away.

any and all suggestions welcome


Solution

  • Of course you can vectoriue this, but it will then still be O(n*m). Better neighbor search algorithms are not vectorized. e.g. kd-tree and ball-tree.

    Both are available in sklearn, and used by the DBSCAN module. Please see the sklearn.neighbors package.