Search code examples
rmatrixearth-movers-distance

Computing EMD with distributions of different sizes using emdist (R)


I am trying to use Earthmover's distance in R via the emdist package to find the distance between two distributions with unlike numbers of observations. My understanding is that EMD can be used for different size distributions, and indeed this is the purpose to find the distance when the distributions have different parameters.

However, I have tried to run multiple codes, and in all cases I receive the error:

Error in emd2d(sample, sample2, dist = "manhattan") : 
  A and B must be matrices of the same dimensions

My codes is the following:

library(emdist)

sample = structure(c(6, 4, 5, 6, 4, 0, 5, 6, 3, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1), dim = c(10L, 2L), dimnames = list(NULL, c("variable", 
"weight")))

sample2 = structure(c(0.25, 0.75, 0, 0.75, 0.75, 0.75, 0, 0, 0.75, 0.25, 
0, 0.25, 0.75, 0.75, 0.25, 0.25, 0, 0.75, 0, 0.75, 0.75, 0.75, 
0.25, 0.75, 1, 0.25, 0.75, 0.25, 0.25, 0.25, 0.25, 0.75, 0, 0.75, 
0.75, 1, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.25, 0.25, 0.25, 
0, 1, 0.25, 0.25, 0.75, 0.75, 0.25, 0.25, 1, 0, 0, 0.25, 0.25, 
0.25, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), dim = c(59L, 
2L), dimnames = list(NULL, c("variable", "weight")))

emd2d(sample,sample2)

which produces the error:

Error in emd2d(sample, sample2, dist = "manhattan") : 
  A and B must be matrices of the same dimensions

I tried transposing the matrices so that they match the example from the package vignette:

emd2d(t(sample),t(sample2))

but of course this does not make the dimensions the same, so it doesn't change anything.

I also tried using the emd and emdr functions instead, but this consistently returns 0 as the distance, which I don't think can be correct, as I have changed the values multiple times but do not get a different output:

emd(sample)
emdr(sample2)

Am I programming the EMD wrongly? It is difficult to tell from the package manual how to apply it to distributions with different observation numbers. Can this be done with the emdist package? Am I misunderstanding what the matrices' structures represent?

If helpful, the weights included in the sample code are not important; I do not need them, my understanding was that these must be specified in the EMD syntax, so I included them all as the same. I only intend to calculate EMD between sample's "variable" and sample2's "variable".


Solution

  • As I understand, your distributions are uniformly distributed (weight=1) on the variable column.

    You can use the kantorovich package, up to a problem... I am the author and I've just realized my code throws an error when the two distributions do not have the same lengths, which is stupid... So for the meantime (I will fix that), you can copy the code of kantorovich_CVXR and comment the lines throwing this error:

    library(CVXR)
    
    kantorovich_CVX <- function(
        mu, nu, dist=NULL, solution=FALSE, stop_if_fail=TRUE, solver = "ECOS", ...
    ){
      m <- length(mu)
      n <- length(nu)
      # checks
      # if(m != n) stop("mu and nu do not have the same length")
      # if(!is.null(dist)){
      #   if(!is(dist, "matrix") || mode(dist) != "numeric")
      #     stop("dist must be a numeric matrix")
      #   if(nrow(dist)!=m || ncol(dist)!=m)
      #     stop("invalid dimensions of the dist matrix")
      # }
      if(sum(mu)!=1 || sum(nu)!=1 || any(mu<0) || any(nu<0)){
        message("Warning: mu and/or nu are not probability measures")
      }
      #
      if(is.null(dist)) dist <- 1-diag(m)
    
      obj <- c(t(dist))
      A <- rbind(t(model.matrix(~0+gl(m,n)))[,],
                 t(model.matrix(~0+factor(rep(1:n,m))))[,])
      x <- Variable(m*n)
      objective   <- Minimize(t(obj) %*% x)
      constraints <- list(x >= 0, A%*%x == c(mu,nu))
      problem     <- Problem(objective, constraints)
    
      kanto <- psolve(problem, solver = solver, ...)
      # status
      if(kanto$status != "optimal"){
        if(stop_if_fail){
          stop(sprintf("No optimal solution found: status %s \n", kanto$status))
        }else{
          warning(sprintf("No optimal solution found: status %s \n", kanto$status))
          return(kanto)
        }
      }
      # output
      out <- kanto$value
      if(solution) attr(out, "solution") <-
        matrix(kanto$getValue(x), nrow=m, byrow=TRUE)
      out
    }
    

    And you can get the distance as follows:

    s1 <- sample[, "variable"]
    s2 <- sample2[, "variable"]
    n1 <- length(s1)
    n2 <- length(s2)
    mu1 <- rep(1, n1) / n1
    mu2 <- rep(1, n2) / n2
    # distance matrix:
    M <- outer(s1, s2, FUN = function(x, y) abs(x - y))
    
    kantorovich_CVX(mu1, mu2, dist = M)
    

    Edit: using 'emdist' package

    You just have to switch the two columns:

    library(emdist)
    A <- sample[, 2:1]
    B <- sample2[, 2:1]
    emd(A, B)
    

    But there's a problem: this does not yield the same result as kantorovich... I'm working on it.

    Edit

    I get the same result as kantorovich if I use emdw. I believe there's a bug in emd.