Search code examples
rtestingvectorhistogramsimilarity

Comparing distribution of two vectors


I have 5 different vectors and then a vector I want to compare them to. What I need is to get the most similiar vector out of the 5 different ones.

The vectors are quite long, so I will just show a little of it:

# Vector to compare to:
v1 <- c(0.2500, 0.4375, 0.1250, 0.3125, 0.0000, 0.5625, 0.1250, 0.1875, 0.1875, 0.1875, 0.1875)

# One of vectors to compare
v2 <- c(2, 1, 0, 1, 1, 1, 1, 0, 2, 1, 2)

# Another of vectors to compare: 
v3 <- c(5, 0, 3, 1, 1, 2, 1, 2, 0, 1, 2)

Practically what I need to do is a statistical test to compare the distribution of histograms given by those vectors and tell which is the closest. I tried to use ks.test, but it had a problem with duplicate values in vectors and p-value returned was like 0.0000000000001.. Any ideas how to do that (except visually)?


Solution

  • It's not clear to me why you need a statistical test if all you want to do is compute which one is closest. Below I'm just computing the histograms directly and comparing their distances.

    Generate data:

    v1 <- c(0.2500, 0.4375, 0.1250, 0.3125, 0.0000, 0.5625, 0.1250,
       0.1875, 0.1875, 0.1875, 0.1875)
    v2 <- c(2, 1, 0, 1, 1, 1, 1, 0, 2, 1, 2)*0.1
    v3 <- c(5, 0, 3, 1, 1, 2, 1, 2, 0, 1, 2)*0.1
    

    Note that I changed vectors 2 and 3 a little bit so their distributions would actually overlap with the comparison vector

    vList <- list(v1,v2,v3)
    brkvec <- seq(0,0.7,by=0.1)
    hList <- lapply(vList,function(x)
         hist(x,plot=FALSE, breaks=brkvec)$counts )
    

    This is a little bit inefficient because it computes all of the pairwise distances and then throws most of them away ...

    dmat <- dist(do.call(rbind,hList))
    dvec <- as.matrix(dmat)[-1,1]
    ##        2        3 
    ## 7.874008 6.000000 
    

    The other option would be to ignore the warning from ks.test() (since it only affects inference, not the computation of the distance statistic)

    ks.dist <- sapply(vList[-1],
            function(x) suppressWarnings(ks.test(v1,x)$statistic))
    ks.dist
    ##         D         D 
    ## 0.6363636 0.4545455
    

    The results match (i.e., v3 is closer to v1 than v2 is)