Search code examples
ralgorithmdata-sciencecluster-analysisdbscan

DBSCAN: variable cluster size


Reposting because i didn't add in my data earlier;

I have been running a DBSCAN algorithm on R for a project where I'm creating clusters of individuals based on the location they are in.

I need clusters of 3 people (k=3), each matched on the location of each individual and my eps value is 1.2

The problem I'm facing with my algorithm is that the cohorts/clusters are of variable size.

This is my output after running the clustering code, and as you can see, there are 5 variables in cluster #2, and I want to split this up into 3 + 2 (so, cluster number 3 will have 3 points and cluster number 4 will have 2 points)

DBSCAN clustering for 10 objects.
Parameters: eps = 1.2, minPts = 3
The clustering contains 2 cluster(s) and 2 noise points.

0 1 2 
2 3 5 

Is there any viable way for me to split up each of these large clusters?

Sample data is here;


              id        long         lat
 [1,] -2.10229619  1.01270296  0.50710753
 [2,] -0.96591987  1.28347407  0.30814912
 [3,] -0.39773171  0.71560717  1.09179544
 [4,] -0.20833566  1.05423809 -1.21860434
 [5,] -0.01893961 -0.27834265 -1.85409180
 [6,]  0.17045645 -0.08790463 -0.69261331
 [7,]  0.54924855 -0.80977566  0.33080001
 [8,]  0.73864461 -0.54869120 -0.08154845
 [9,]  0.92804066 -0.49214358  1.37150115
[10,]  1.30683277 -1.84916458  0.23750465


Solution

  • I'd use integer programming for this.

    Set a distance limit. Enumerate all pairs where the distance is under the limit. Extend these to all triples where each of the three pairwise distances is under the limit.

    Formulate an integer program with a 0-1 variable for each triple and each pair. The score for a pair/triple is the sum of pairwise distances. The objective is to minimize the sum of the scores of the triples and pairs chosen. For each point, we constrain the sum of triples and pairs that contain it to equal one. We also constrain the number of pairs to be at most two.

    For pairs {1, 2}, {1, 3}, {2, 3}, {2, 4}, {3, 4}, {4, 5} and triples {1, 2, 3}, {2, 3, 4}, the program looks like this:

    minimize score({1, 2, 3}) x_{1,2,3} + score({2, 3, 4}) x_{2,3,4} +
    score({1,2}) x_{1,2} + score({1,3}) x_{1,3} + ... + score({4,5}) x_{4,5}
    
    subject to
    x_{1,2} + x_{1,3} + x_{1,2,3} = 1
    x_{1,2} + x_{2,3} + x_{2,4} + x_{1,2,3} + x_{2,3,4} = 1
    x_{1,3} + x_{2,3} + x_{3,4} + x_{1,2,3} + x_{2,3,4} = 1
    x_{3,4} + x_{4,5} + x_{2,3,4} = 1
    x_{4,5} = 1
    x_{1,2} + x_{1,3} + ... + x_{4,5} <= 2
    
    binary variables
    x_{1,2}, x_{1,3}, ..., x_{4,5}, x_{1,2,3}, x_{2,3,4} in {0, 1}
    

    The Rglpk package provides access to a suitable integer program solver.