Search code examples
rggplot2distributiondata-fitting

How to fit negative binomial function to data that is facet_wrapped by a factor in R ggplot?


I'm trying to fit a negative binomial distribution to counts data but scaled back to counts like in this example In my data, I have to separate out the binomial function plotting for two species. However, there is not easy way to specify this within the function and getting the line legends with parameter values in the key for both species.

set.seed(111)
count <- rbinom(500,100,0.1) 
species <- rep(c("A","B"),time = 250)
df <- data.frame(count,species)

#Specifying negative binomial function
negbinom.params <- fitdistr(df$count,"negative binomial", method = "SANN")$estimate
dist.params <- map(list(`Negative Binomial` = negbinom.params),~ map2(names(.),.,~ paste0(.x," = ",round(.y,2))) %>% unlist %>% paste0(.,collapse = ", ")) %>% map2_chr(names(.),., ~ paste(.x,.y,sep=":\n"))

#Plotting
mybinwidth = 2
ggplot(df, aes(x = count, colour = species, fill = species)) + 
  facet_grid(.~species) +  
  geom_histogram(aes(y=..count..),alpha = 0.5, binwidth = mybinwidth) + 
  stat_function(aes(color = "orange"), 
                fun = function(x,size, mu) {
                    mybinwidth * nrow(df) * dnbinom(x,size = size, mu = mu)
                },
                args=fitdistr(df$count, "negative binomial", method="SANN")$estimate, 
                xlim=c(0,50),n=20)

enter image description here


Solution

  • You are right, this is a bit of a pain to get right. I've adapted your example a little bit to show two different distribution more clearly. Here is my attempt to make your approach work:

    library(ggplot2)
    library(MASS)
    #> Warning: package 'MASS' was built under R version 3.6.2
    
    set.seed(111)
    
    df <- data.frame(
      count = rnbinom(500, rep(c(5, 10), each  = 250), 0.5),
      species = rep(c("A", 'B'), each = 250)
    )
    
    # Not the prettiest formatting, but it'll show the point
    ests <- sapply(split(df$count, df$species), function(x) {
      est <- fitdistr(x, "negative binomial", method = "SANN")$estimate
      formatted <- paste0(names(est)[1], " = ", format(est, digits = 2)[1], ",",
                          names(est)[2], " = ", format(est, digits = 2)[2])
      formatted
    })
    
    mybinwidth <- 1
    
    spec_A = df[df$species == "A",]
    spec_B = df[df$species == "B",]
    
    ggplot(df, aes(count)) +
      geom_histogram(binwidth = mybinwidth,
                     aes(fill = species), alpha = 0.5,
                     position = "identity") +
      stat_function(aes(color = "A"), 
                    data = data.frame(species = "A"),
                    fun = function(x, size, mu) {
                      mybinwidth * nrow(spec_A) * dnbinom(x,size = size, mu = mu)
                    },
                    args = fitdistr(spec_A$count, "negative binomial", method="SANN")$estimate, 
                    xlim = c(0, max(df$count)), n= max(df$count) + 1, inherit.aes = FALSE) +
      stat_function(aes(color = "B"), 
                    data = data.frame(species = "B"),
                    fun = function(x, size, mu) {
                      mybinwidth * nrow(spec_B) * dnbinom(x,size = size, mu = mu)
                    },
                    args = fitdistr(spec_B$count, "negative binomial", method="SANN")$estimate, 
                    xlim = c(0, max(df$count)), n= max(df$count) + 1, inherit.aes = FALSE) +
      scale_colour_discrete(labels = unname(ests), name = "fit") +
      facet_wrap(~ species)
    #> Warning: `mapping` is not used by stat_function()
    #> Warning: `data` is not used by stat_function()
    #> Warning: `mapping` is not used by stat_function()
    #> Warning: `data` is not used by stat_function()
    

    Created on 2020-04-15 by the reprex package (v0.3.0)

    There are also packages that do the majority of this work for you. Disclaimer: I wrote ggh4x, so I'm not unbiased. You can also replace the ggplot code with the following (assuming similar preprocessing)

    library(ggh4x)
    ggplot(df, aes(count)) +
      geom_histogram(binwidth = mybinwidth,
                     aes(fill = species), alpha = 0.5,
                     position = "identity") +
      stat_theodensity(aes(colour = species,
                           y = after_stat(count * mybinwidth)),
                       distri = "nbinom") +
      scale_colour_discrete(labels = unname(ests), name = "fit") +
      facet_wrap(~ species)
    

    Hope that helped!