To clarify, I have data sets with many dimensions like hundreds and thousands which maybe need to normalize.
I would like to calculate the distance to the k-th neighbor in the nearest neighbor graph. For this data set, I calculated the average distance for the k-th nearest neighbors but the result was too large and strange! For example, when I used k = 5 then the resulted average distance was 2147266047 while when k increased to 12 then thats average increased to 4161197373 !! I am sure there is something wrong but I don't know why exactly! maybe its because of the Euclidean distance used or maybe the I need to normalize the data before calculating the distance.
What is confusing me more is that the method worked perfectly when applying it on another data set like iris. Down below my code
data(iris)
iris <- as.matrix(iris[,1:4])
distance<- ppx(iris) %>% nndist(k = 3)
as.vector(distance)
avg<-(sum(distance)/length(distance))
avg
My first question: is it normal to get large values like what I got for Epsilon or there is something wrong in processing the data.
The other question: are there other methods to estimate the value of Epsilon
I think that you have, to a large extent, answered your own question.
First, I believe that you calculated correctly. Here is my code to compute the same things.
library(dbscan)
summary(kNNdist(as.matrix(LSVT), 5))
1 2 3 4 5
Min. :2.326e+07 Min. :5.656e+07 Min. :9.132e+07 Min. :1.316e+08 Min. :1.981e+08
1st Qu.:1.104e+08 1st Qu.:2.178e+08 1st Qu.:3.041e+08 1st Qu.:3.811e+08 1st Qu.:5.201e+08
Median :2.231e+08 Median :3.783e+08 Median :4.964e+08 Median :6.183e+08 Median :7.723e+08
Mean :7.414e+08 Mean :1.195e+09 Mean :1.557e+09 Mean :1.849e+09 Mean :2.147e+09
3rd Qu.:4.633e+08 3rd Qu.:9.285e+08 3rd Qu.:1.189e+09 3rd Qu.:1.391e+09 3rd Qu.:1.533e+09
Max. :1.861e+10 Max. :3.379e+10 Max. :3.512e+10 Max. :3.795e+10 Max. :4.600e+10
Notice that the mean for the 5th nearest neighbor is 2.147e+09 which is what you got.
Should that value be surprising? No. Some of your individual dimensions contain huge variations. For example, using only dimension 189
max(LSVT[,189]) - min(LSVT[,189])
[1] 80398191552
summary(kNNdist(as.matrix(LSVT[,189]), 5))
1 2 3 4 5
Min. :4.098e+04 Min. :3.259e+07 Min. :4.034e+07 Min. :5.791e+07 Min. :7.772e+07
1st Qu.:3.163e+07 1st Qu.:1.016e+08 1st Qu.:1.657e+08 1st Qu.:2.309e+08 1st Qu.:2.909e+08
Median :7.078e+07 Median :1.877e+08 Median :2.502e+08 Median :3.561e+08 Median :4.610e+08
Mean :3.580e+08 Mean :8.389e+08 Mean :1.112e+09 Mean :1.345e+09 Mean :1.623e+09
3rd Qu.:1.928e+08 3rd Qu.:5.211e+08 3rd Qu.:6.996e+08 3rd Qu.:9.491e+08 3rd Qu.:1.008e+09
Max. :1.036e+10 Max. :2.787e+10 Max. :2.888e+10 Max. :3.126e+10 Max. :3.770e+10
These dimensions on a very large scale will completely overwhelm the dimensions on a small scale. Because of this, you should almost certainly normalize the data.
summary(kNNdist(scale(as.matrix(LSVT)), 5))
1 2 3 4 5
Min. : 7.002 Min. : 7.511 Min. : 7.742 Min. : 7.949 Min. : 8.047
1st Qu.: 8.701 1st Qu.: 9.261 1st Qu.: 9.501 1st Qu.: 9.664 1st Qu.: 9.851
Median :10.010 Median :10.425 Median :10.626 Median :10.890 Median :11.172
Mean :11.456 Mean :12.417 Mean :12.927 Mean :13.306 Mean :13.551
3rd Qu.:11.622 3rd Qu.:12.176 3rd Qu.:12.492 3rd Qu.:12.876 3rd Qu.:13.093
Max. :70.220 Max. :76.359 Max. :83.243 Max. :87.601 Max. :88.197
Why is this different than the iris data? There are two big difference between your data and the iris data. Your data contains attributes on vastly different scales, whereas all of the iris attributes are comparably sized. Second, the values for the iris data are all within an order of magnitude of 1. Your data has values that are much smaller and much larger.
summary(LSVT[,c(27,189)])
Jitter..pitch_TKEO_prc75 entropy_shannon2_10_coef
Min. :-4.799e-09 Min. :-8.233e+10
1st Qu.:-1.582e-11 1st Qu.:-1.831e+10
Median : 1.987e-11 Median :-1.090e+10
Mean : 3.901e-10 Mean :-1.576e+10
3rd Qu.: 1.164e-10 3rd Qu.:-6.748e+09
Max. : 9.440e-09 Max. :-1.934e+09
summary(iris[,1:4])
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Response to comment
Using the R scale
function is what I would call standardization. There are other ways to scale the data. I do not mean to imply that standardization is the best. My intent with this answer was only to point out why you were seeing the behavior that you were seeing and point the direction for how to address it. Your data has variables on vastly different scales and you are computing distances. That will make the variables on a small scale have almost no influence on the result. Probably not what you want.
Standardization is a natural first attempt at addressing that. You can probably use that to get a better distance metric and hopefully a better understanding of how your variables interact. But other or additional
transformations to your data may be needed.