Search code examples
rapache-sparkcluster-analysisdbscan

DBSCAN Clustering with additional features


Can I apply DBSCAN with other features in addition to location ? and if it is available how can it be done through R or Spark ?

I tried preparing an R table of 3 columns one for latitude, longitude and score (the feature I wanna cluster upon in addition to space feature) and when tried running DBSCAN with the following R code, I get the following plot which tells that the algorithm makes clusters upon each pair of columns (long, lat), (long, score), (lat, score), ...

my R Code:

df = read.table("/home/ahmedelgamal/Desktop/preparedData")
var = dbscan(df, eps = .013)
plot(x = var, data = df)

and the plot I get: enter image description here


Solution

  • You are misinterpreting the plot.

    You don't get one result per plot, but all plots show the same clusters, only in different attributes.

    But you also have the issue that the R version is (to my knowledge) only fast for Euclidean distance.

    In your current code, points are neighbors if (lat[i]-lat[j])^2+(lon[i]-lon[j])^2+(score[i]-score[j])^2 <= eps^2. This bad because: 1. latitude and longitude are not Euclidean, you should be using haversine instead, and 2. your additional attribute has much larger scale and thus you pretty much only cluster points with near-zero score, and 3) your score attribute is skewed.

    For this problrm you should probably be using Generalized DBSCAN. Points are similar if their haversine distance is less than e.g. 1 mile (you want to measure geographic distance here, not coordinates, because of distortion) and if their score differs by a factor of at most 1.1 (i.e. compare score[y] / score[x] or work in logspace?). Since you want both conditipns to hold, the usual Euclidean DBSCAN implementation is not yet enough, but you need a Generalized DBSCAN that allows multiple conditions. Look for an implementation of Generalized DBSCAN instead (I believe there id one in ELKI that you may be able to access from Spark), or implement it yourself. It's not very hard to do.

    If quadratic runtime is okay for you, you can probably use any distance-matrix-based DBSCAN, and simply "hack" a binary distance matrix:

    1. compute Haversine distances
    2. compute Score dissimilarity
    3. distance = 0 if haversine < distance-threshold and score-dissimilarity < score-threshold, otherwise 1.
    4. run DBSCAN with precomputed distance matrix and eps=0.5 (since it is a binary matrix, don't change eps!)

    It's reasonably fast, but needs O(n^2) memory. In my experience, the indexes of ELKI yield a good speedup if you have larger data, and are worth a try if you run out of memory or time.