essentially I applied a DBSCAN
algorithm (sklearn
) with an euclidean distance on a subset of my original data. I found my clusters and all is fine: except for the fact that I want to keep only values that are far enough from those on which I did not run my analysis on. I have a new distance to test such new stuff with and I wanted to understand how to do it WITHOUT numerous nested loops.
in a picture:
my found clusters are in blue whereas the red ones are the points to which I don't want to be near. the crosses are the points belonging to the cluster that are carved out as they are within the new distance I specified.
now, as much I could do something of the sort:
for i in red_points:
for j in blu_points:
if dist(i,j) < given_dist:
original_dataframe.remove(j)
I refuse to believe there isn't a vectorized method. also, I can't afford to do as above simply because I'll have huge tables to operate upon and I'd like to avoid my CPU to evaporate away.
any and all suggestions welcome
Of course you can vectoriue this, but it will then still be O(n*m). Better neighbor search algorithms are not vectorized. e.g. kd-tree and ball-tree.
Both are available in sklearn, and used by the DBSCAN module. Please see the sklearn.neighbors
package.