Search code examples
rvalidationbinning

R: Comparison between two hexbins with applying KL divergence


Suppose I have two data sets with different sizes, each data set contains x and y to locate each observation.

set.seed(1)
x1 <- runif(1000,-195.5,195.5)
y1 <- runif(1000,-49,49)
data1 <- data.frame(x1,y1)
x2 <- runif(2000,-195.5,195.5)
y2 <- runif(2000,-49,49)
data2 <- data.frame(x2,y2)

Here I generated two data sets with random locations within an specific area.

Then I generated two hexbins of each dataset. And I know for achieving tracing back the bins, I need to set IDs = TRUE

hbin_1 <- hexbin(x=data1$x1,y=data1$y1,xbins=30,shape=98/391,IDs=TRUE)
hbin_2 <- hexbin(x=data2$x2,y=data2$y2,xbins=30,shape=98/391,IDs=TRUE)

In next step, I wanna apply KL divergence to achieve comparison of these two datasets. Then the problem is how can I get the matching bin in second data set to the first data set? (I wanna compare the bins with same location in two different datasets)

UPDATE We can get the table contains the cell name(bin number) with corresponding count of observations in this bin by

tI1 <- table(hbin_1@cID)
tI2 <- table(hbin_2@cID)

The problem is the bin numbers are different between dataset1 and dataset2. Even we set same xbins and shape in the function hexbin, we still get different bins of two datasets. Then how can I compare the two datasets (or get bins with same location)?


Solution

  • The function hexbin doesn't not return empty bins. Hence, even we set the xbins, xbnds and ybnds same, the returned hexbin results can be different for two datasets.

    We can use kde2d from the package MASS to achieve two-dimensional kernel density estimation.

    b1 <- kde2d(data1$x1,data1$y1,lims = c(xbnds,ybnds))
    b2 <- kde2d(data2$x2,data2$y2,lims = c(xbnds,ybnds))
    

    Then, we can get two vectors of kernel density estimation of two datasets, and then normalising the results by dividing by the sum of each vector of the estimated density. Finally, we can apply KL divergence to quantify the similarity of the distributions.

    z1 <- as.vector(b1$z)
    z2 <- as.vector(b2$z)
    z1 <- mapply("/",z1,0.01509942)
    z2 <- mapply("/",z2,0.01513236)
    kullback.leibler(z1, z2)