r machine-learning classification cluster-analysis nearest-neighbor

Strange distance for the k-th neighbour in the nearest neighbor graph

To clarify, I have data sets with many dimensions like hundreds and thousands which maybe need to normalize.

I would like to calculate the distance to the k-th neighbor in the nearest neighbor graph. For this data set, I calculated the average distance for the k-th nearest neighbors but the result was too large and strange! For example, when I used k = 5 then the resulted average distance was 2147266047 while when k increased to 12 then thats average increased to 4161197373 !! I am sure there is something wrong but I don't know why exactly! maybe its because of the Euclidean distance used or maybe the I need to normalize the data before calculating the distance.

What is confusing me more is that the method worked perfectly when applying it on another data set like iris. Down below my code

data(iris)
iris <- as.matrix(iris[,1:4])
distance<- ppx(iris) %>% nndist(k = 3)
as.vector(distance)
avg<-(sum(distance)/length(distance))
avg

My first question: is it normal to get large values like what I got for Epsilon or there is something wrong in processing the data.

The other question: are there other methods to estimate the value of Epsilon

Solution

I think that you have, to a large extent, answered your own question.

First, I believe that you calculated correctly. Here is my code to compute the same things.

library(dbscan)
summary(kNNdist(as.matrix(LSVT), 5))
       1                   2                   3                   4                   5            
 Min.   :2.326e+07   Min.   :5.656e+07   Min.   :9.132e+07   Min.   :1.316e+08   Min.   :1.981e+08  
 1st Qu.:1.104e+08   1st Qu.:2.178e+08   1st Qu.:3.041e+08   1st Qu.:3.811e+08   1st Qu.:5.201e+08  
 Median :2.231e+08   Median :3.783e+08   Median :4.964e+08   Median :6.183e+08   Median :7.723e+08  
 Mean   :7.414e+08   Mean   :1.195e+09   Mean   :1.557e+09   Mean   :1.849e+09   Mean   :2.147e+09  
 3rd Qu.:4.633e+08   3rd Qu.:9.285e+08   3rd Qu.:1.189e+09   3rd Qu.:1.391e+09   3rd Qu.:1.533e+09  
 Max.   :1.861e+10   Max.   :3.379e+10   Max.   :3.512e+10   Max.   :3.795e+10   Max.   :4.600e+10

Notice that the mean for the 5th nearest neighbor is 2.147e+09 which is what you got.

Should that value be surprising? No. Some of your individual dimensions contain huge variations. For example, using only dimension 189

max(LSVT[,189]) - min(LSVT[,189])
[1] 80398191552

summary(kNNdist(as.matrix(LSVT[,189]), 5))
       1                   2                   3                   4                   5            
 Min.   :4.098e+04   Min.   :3.259e+07   Min.   :4.034e+07   Min.   :5.791e+07   Min.   :7.772e+07  
 1st Qu.:3.163e+07   1st Qu.:1.016e+08   1st Qu.:1.657e+08   1st Qu.:2.309e+08   1st Qu.:2.909e+08  
 Median :7.078e+07   Median :1.877e+08   Median :2.502e+08   Median :3.561e+08   Median :4.610e+08  
 Mean   :3.580e+08   Mean   :8.389e+08   Mean   :1.112e+09   Mean   :1.345e+09   Mean   :1.623e+09  
 3rd Qu.:1.928e+08   3rd Qu.:5.211e+08   3rd Qu.:6.996e+08   3rd Qu.:9.491e+08   3rd Qu.:1.008e+09  
 Max.   :1.036e+10   Max.   :2.787e+10   Max.   :2.888e+10   Max.   :3.126e+10   Max.   :3.770e+10

These dimensions on a very large scale will completely overwhelm the dimensions on a small scale. Because of this, you should almost certainly normalize the data.

summary(kNNdist(scale(as.matrix(LSVT)), 5))
       1                2                3                4                5         
 Min.   : 7.002   Min.   : 7.511   Min.   : 7.742   Min.   : 7.949   Min.   : 8.047  
 1st Qu.: 8.701   1st Qu.: 9.261   1st Qu.: 9.501   1st Qu.: 9.664   1st Qu.: 9.851  
 Median :10.010   Median :10.425   Median :10.626   Median :10.890   Median :11.172  
 Mean   :11.456   Mean   :12.417   Mean   :12.927   Mean   :13.306   Mean   :13.551  
 3rd Qu.:11.622   3rd Qu.:12.176   3rd Qu.:12.492   3rd Qu.:12.876   3rd Qu.:13.093  
 Max.   :70.220   Max.   :76.359   Max.   :83.243   Max.   :87.601   Max.   :88.197

Why is this different than the iris data? There are two big difference between your data and the iris data. Your data contains attributes on vastly different scales, whereas all of the iris attributes are comparably sized. Second, the values for the iris data are all within an order of magnitude of 1. Your data has values that are much smaller and much larger.

summary(LSVT[,c(27,189)])
 Jitter..pitch_TKEO_prc75 entropy_shannon2_10_coef
 Min.   :-4.799e-09       Min.   :-8.233e+10      
 1st Qu.:-1.582e-11       1st Qu.:-1.831e+10      
 Median : 1.987e-11       Median :-1.090e+10      
 Mean   : 3.901e-10       Mean   :-1.576e+10      
 3rd Qu.: 1.164e-10       3rd Qu.:-6.748e+09      
 Max.   : 9.440e-09       Max.   :-1.934e+09 


summary(iris[,1:4])
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

Response to comment

Using the R scale function is what I would call standardization. There are other ways to scale the data. I do not mean to imply that standardization is the best. My intent with this answer was only to point out why you were seeing the behavior that you were seeing and point the direction for how to address it. Your data has variables on vastly different scales and you are computing distances. That will make the variables on a small scale have almost no influence on the result. Probably not what you want. Standardization is a natural first attempt at addressing that. You can probably use that to get a better distance metric and hopefully a better understanding of how your variables interact. But other or additional transformations to your data may be needed.