Search code examples
rprobabilitydistance

Kullback-Leibler over discrete samples


I have two samples following two different distributions. I'm trying to have a measure of the distance between those 2 distributions. My idea was to get the kullback-leibler distance between those but I'm meeting difficulty since my samples don't cover the same space.

To illustrate my problem, I will use N(0, 1) and N(1, 1) as an example, but in reality, those distributions will not be normal.

In theory the KL distance between N(0, 1) and N(1, 1) is 0.5 as confirmed bellow:

# check with normal distribution pdf
p1 <- dnorm(-10:10, 1, 1)
q1 <- dnorm(-10:10, 0, 1)
z1 <- rbind(p1 / sum(p1), q1 / sum(q1))
philentropy::distance(z1, method =  "kullback-leibler" )

kullback-leibler 0.5

Now my two samples don't cover the same space, as illustrated bellow:

# discrete observations (ie only integer)
nbobsdiv2 <- 100
space1 <- data.table(x = sort(sample(x = -nbobsdiv2:nbobsdiv2, size  = nbobsdiv2*0.75)))
space2 <- data.table(x = sort(sample(x = -nbobsdiv2:nbobsdiv2, size  = nbobsdiv2*0.75)))

setkey(space1, x)
setkey(space2, x)

space1[, p := dnorm(x, 0, 1)]
space2[, p := dnorm(x, 1, 1)]

so I've tried either to get the union of both space, or the intersect to get the expected value of 0.5
First comparing the pdf observations over the union of both samples space:

# union of both space, filling missing pd with 0
space21.min <- max(min(space1), min(space2))
space21.max <- min(max(space1), max(space2))
space12.union <- data.table(x = sort(union(space1$x, space2$x)))[x >= space21.min & x <= space21.max, ]
setkey(space12.union, x)
pdf12.union <- space2[space1[space12.union]]
colnames(pdf12.union) <- c("x", "p2", "p1")
pdf12.union[is.na(pdf12.union), ] <- 0

kld12.union <- philentropy::distance(t(pdf12.union[, .(p1/sum(p1), p2/sum(p2))]), 
                                     method = "kullback-leibler")
kld12.union

kullback-leibler 11.50815

Then comparing the pdf observations over the intersection of both samples space:

# intersect of both spaces
space12.intersect <- data.table(x = sort(intersect(space1$x, space2$x)))
setkey(space12.intersect, x)
pdf12.intersect <- space2[space1[space12.intersect]]
colnames(pdf12.intersect) <- c("x", "p2", "p1")

kld12.intersect <- philentropy::distance(t(pdf12.intersect[, .(p1/sum(p1), p2/sum(p2))]), 
                                         method = "kullback-leibler")
kld12.intersect

kullback-leibler 6.914469e-13

So neither are working. I wonder which direction to take at this point? My next idea would be to fit a distribution to my data and then generate samples but in reality my distributions d1 and d2 are probably non gaussian trimodal so this part would be problematic. Any tip appreciated...


Solution

  • The Kullback–Leibler (KL) divergence is infinity when the probability distributions P and Q have disjoint support: KL divergence between which distributions could be infinity.

    Take this example:

    x <- seq(6)
    
    P <- c(1 / 4, 1 / 4, 1 / 4, 1 / 4, 0, 0)
    Q <- c(0, 0, 1 / 4, 1 / 4, 1 / 4, 1 / 4)
    

    In theory, the KL divergence DKL(P ‖ Q) between the probability distributions P and Q is infinity. philentropy::distance calculates a finite DKL because it adds epsilon to every probability to avoid division by zero. See the documentation.

    philentropy::distance(
      rbind(P, Q),
      method = "kullback-leibler"
    )
    #> Metric: 'kullback-leibler' using unit: 'log'; comparing: 2 vectors.
    #> kullback-leibler 
    #>         5.063316
    

    Consider other metric(s) for the distance between the two distributions. For example, the earth mover's distance (EMD).

    emdist::emd(cbind(P, x), cbind(Q, x))
    #> [1] 2
    

    Created on 2022-07-28 by the reprex package (v2.0.1)