How to calculate percent overlap in distributions in r?

I have a dummy dataframe below where I'd like to calculate the pairwise percent overlap between site distributions. Basically, what percent of site1 and site2 are overlapping, site2 vs site3 and site1 vs site3?

structure(list(site = c("site1", "site1", "site1", "site1", "site1", 
"site1", "site1", "site1", "site1", "site1", "site2", "site2", 
"site2", "site2", "site2", "site2", "site2", "site2", "site2", 
"site2", "site3", "site3", "site3", "site3", "site3", "site3", 
"site3", "site3", "site3", "site3"), total = c(0.4191, 0.2844, 
0.2611, 0.2743, 0.2938, 0.3287, 0.2992, 0.4062, 0.2946, 0.2671, 
0.3832, 0.3875, 0.3118, 0.4506, 0.4215, 0.4266, 0.3518, 0.4446, 
0.4255, 0.3208, 0.2377, 0.2818, 0.2526, 0.2425, 0.2973, 0.4539, 
0.357, 0.2865, 0.3624, 0.3026)), class = c("grouped_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -30L), groups = structure(list(
    site = c("site1", "site2", "site3"), .rows = structure(list(
        1:10, 11:20, 21:30), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE))

ggplot(aes(x = total, group = site, fill = site)) +
  geom_density(adjust = 1.5, alpha = 0.3)

Solution

Your density plot is perhaps a little misleading, since a density plot will extend outside the actual range of the data on the x axis, and tend to give a much higher estimate for the overlap than actually exists in your data. A better visualization might be:

df %>%
  group_by(site) %>%
  mutate(site = factor(site)) %>%
  summarize(xmin = min(total), xmax = max(total), 
            ymin = as.numeric(site), ymax = as.numeric(site)) %>%
  ggplot() +
  geom_segment(aes(x = xmin, xend = xmax, y = ymin, yend = ymax, color = site),
               size = 2) +
  scale_y_continuous(breaks = 1:3, expand = c(1, 1)) +
  theme_bw()
#> `summarise()` has grouped output by 'site'. You can override using the
#> `.groups` argument.

One approach to creating pairwise comparisons is to use expand.grid to get all unique pairs of sites:

comp_df <- expand.grid(A = sort(unique(df$site)), 
                       B = sort(unique(df$site)))

Then we need a function that will take the name of two sites and calculate the percentage overlap between their ranges. I'm doing this here in a rather pedestrian way using simple arithmetic:

comp_func <- function(a, b) {
  max_a <- max(df$total[df$site == a])
  min_a <- min(df$total[df$site == a])
  max_b <- max(df$total[df$site == b])
  min_b <- min(df$total[df$site == b])
  max_b <- ifelse(max_b > max_a, max_a, max_b)
  min_b <- ifelse(min_b < min_a, min_a, min_b)
  (max_b - min_b) / (max_a - min_a)
}

Now we can Map this function to the rows of our comparison data frame so that we get a pairwise estimate for each unique pair of sites.

comp_df$overlap <- unlist(Map(comp_func, a = comp_df$A, b = comp_df$B))

Finally, we want to remove the entries where an area is tested against overlap with itself, since this will always be 100%:

comp_df <- comp_df[comp_df$A != comp_df$B,]

The final result can be sense checked against our plot, and can be seen to make sense (the overlap column is the proportion of the site in column A that is overlapped by the site in column B)

comp_df
#>       A     B   overlap
#> 2 site2 site1 0.7730548
#> 3 site3 site1 0.7308048
#> 4 site1 site2 0.6791139
#> 6 site3 site2 0.6419981
#> 7 site1 site3 1.0000000
#> 8 site2 site3 1.0000000

So for example, we can see that site 1 and site 2 are 100% overlapped by site 3, as we can confirm in our plot, whereas site 1 is about 68% overlapped by site 2.

^{Created on 2022-04-25 by the reprex package (v2.0.1)}