Search code examples
rmachine-learningclassificationcluster-analysisnearest-neighbor

Strange distance for the k-th neighbour in the nearest neighbor graph


To clarify, I have data sets with many dimensions like hundreds and thousands which maybe need to normalize.

I would like to calculate the distance to the k-th neighbor in the nearest neighbor graph. For this data set, I calculated the average distance for the k-th nearest neighbors but the result was too large and strange! For example, when I used k = 5 then the resulted average distance was 2147266047 while when k increased to 12 then thats average increased to 4161197373 !! I am sure there is something wrong but I don't know why exactly! maybe its because of the Euclidean distance used or maybe the I need to normalize the data before calculating the distance.

What is confusing me more is that the method worked perfectly when applying it on another data set like iris. Down below my code

data(iris)
iris <- as.matrix(iris[,1:4])
distance<- ppx(iris) %>% nndist(k = 3)
as.vector(distance)
avg<-(sum(distance)/length(distance))
avg

My first question: is it normal to get large values like what I got for Epsilon or there is something wrong in processing the data.

The other question: are there other methods to estimate the value of Epsilon


Solution

  • I think that you have, to a large extent, answered your own question.

    First, I believe that you calculated correctly. Here is my code to compute the same things.

    library(dbscan)
    summary(kNNdist(as.matrix(LSVT), 5))
           1                   2                   3                   4                   5            
     Min.   :2.326e+07   Min.   :5.656e+07   Min.   :9.132e+07   Min.   :1.316e+08   Min.   :1.981e+08  
     1st Qu.:1.104e+08   1st Qu.:2.178e+08   1st Qu.:3.041e+08   1st Qu.:3.811e+08   1st Qu.:5.201e+08  
     Median :2.231e+08   Median :3.783e+08   Median :4.964e+08   Median :6.183e+08   Median :7.723e+08  
     Mean   :7.414e+08   Mean   :1.195e+09   Mean   :1.557e+09   Mean   :1.849e+09   Mean   :2.147e+09  
     3rd Qu.:4.633e+08   3rd Qu.:9.285e+08   3rd Qu.:1.189e+09   3rd Qu.:1.391e+09   3rd Qu.:1.533e+09  
     Max.   :1.861e+10   Max.   :3.379e+10   Max.   :3.512e+10   Max.   :3.795e+10   Max.   :4.600e+10  
    

    Notice that the mean for the 5th nearest neighbor is 2.147e+09 which is what you got.

    Should that value be surprising? No. Some of your individual dimensions contain huge variations. For example, using only dimension 189

    max(LSVT[,189]) - min(LSVT[,189])
    [1] 80398191552
    
    summary(kNNdist(as.matrix(LSVT[,189]), 5))
           1                   2                   3                   4                   5            
     Min.   :4.098e+04   Min.   :3.259e+07   Min.   :4.034e+07   Min.   :5.791e+07   Min.   :7.772e+07  
     1st Qu.:3.163e+07   1st Qu.:1.016e+08   1st Qu.:1.657e+08   1st Qu.:2.309e+08   1st Qu.:2.909e+08  
     Median :7.078e+07   Median :1.877e+08   Median :2.502e+08   Median :3.561e+08   Median :4.610e+08  
     Mean   :3.580e+08   Mean   :8.389e+08   Mean   :1.112e+09   Mean   :1.345e+09   Mean   :1.623e+09  
     3rd Qu.:1.928e+08   3rd Qu.:5.211e+08   3rd Qu.:6.996e+08   3rd Qu.:9.491e+08   3rd Qu.:1.008e+09  
     Max.   :1.036e+10   Max.   :2.787e+10   Max.   :2.888e+10   Max.   :3.126e+10   Max.   :3.770e+10
    

    These dimensions on a very large scale will completely overwhelm the dimensions on a small scale. Because of this, you should almost certainly normalize the data.

    summary(kNNdist(scale(as.matrix(LSVT)), 5))
           1                2                3                4                5         
     Min.   : 7.002   Min.   : 7.511   Min.   : 7.742   Min.   : 7.949   Min.   : 8.047  
     1st Qu.: 8.701   1st Qu.: 9.261   1st Qu.: 9.501   1st Qu.: 9.664   1st Qu.: 9.851  
     Median :10.010   Median :10.425   Median :10.626   Median :10.890   Median :11.172  
     Mean   :11.456   Mean   :12.417   Mean   :12.927   Mean   :13.306   Mean   :13.551  
     3rd Qu.:11.622   3rd Qu.:12.176   3rd Qu.:12.492   3rd Qu.:12.876   3rd Qu.:13.093  
     Max.   :70.220   Max.   :76.359   Max.   :83.243   Max.   :87.601   Max.   :88.197  
    

    Why is this different than the iris data? There are two big difference between your data and the iris data. Your data contains attributes on vastly different scales, whereas all of the iris attributes are comparably sized. Second, the values for the iris data are all within an order of magnitude of 1. Your data has values that are much smaller and much larger.

    summary(LSVT[,c(27,189)])
     Jitter..pitch_TKEO_prc75 entropy_shannon2_10_coef
     Min.   :-4.799e-09       Min.   :-8.233e+10      
     1st Qu.:-1.582e-11       1st Qu.:-1.831e+10      
     Median : 1.987e-11       Median :-1.090e+10      
     Mean   : 3.901e-10       Mean   :-1.576e+10      
     3rd Qu.: 1.164e-10       3rd Qu.:-6.748e+09      
     Max.   : 9.440e-09       Max.   :-1.934e+09 
    
    
    summary(iris[,1:4])
      Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
     Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
     1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
     Median :5.800   Median :3.000   Median :4.350   Median :1.300  
     Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
     3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
     Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500
    

    Response to comment

    Using the R scale function is what I would call standardization. There are other ways to scale the data. I do not mean to imply that standardization is the best. My intent with this answer was only to point out why you were seeing the behavior that you were seeing and point the direction for how to address it. Your data has variables on vastly different scales and you are computing distances. That will make the variables on a small scale have almost no influence on the result. Probably not what you want. Standardization is a natural first attempt at addressing that. You can probably use that to get a better distance metric and hopefully a better understanding of how your variables interact. But other or additional transformations to your data may be needed.