r simulation distribution montecarlo poisson

Simulating rate data (negative-binomial distribution)?

I am attempting to simulate data that approximates rate data - that is: count data that generally fit a negative binomial distribution but also an offset term for survey effort.

I think I can simulate counts well using the negative-binomial function (rnbinom()), but then don't have a way to account for the offset term - which will be random with each survey. Put another way:

What is the best way to simulate non-integer rate data?

Is there a realistic way to simulate the offset term values?

Do I need to use a different distribution beyond negative-binomial to generate a realistic range of non-integer values?

Background: Our surveys measure counts of individual per unit of survey effort (time), and the resulting rate is a positive non-integer ( >= 0). The survey count data seems well modeled with a negative-binomial distribution, and in a GLM framework, I would account for effort using survey time as an offset term. In the simulated data below, I generate a negative-binomial distribution to represent data within my actual survey. The offset term is simulated as a random uniform variable between 2-10 (the range of search times in minutes). Rate is then calculated as counts/time.

I plot histograms of both counts and rate to help demonstrate that rates here take many fractional values between integers. Because survey counts are often correlated with survey effort, it is critical that I ultimately use the rate data for analysis (i.e. figure 'B' below).


library(tidyverse)
theme_set(theme_classic())


d = data.frame(counts = rnbinom(n = 500,mu = 5,size = 1),  ## dispersion parameter 'theta' set to 1
               time = runif(500,2,10)) %>% 
  mutate(rate = counts/time)


## Count data histogram
 ggplot(d, aes(x = counts))+
  geom_histogram(fill = 'peachpuff',color = 'black')+
  ylab('frequency')+
  scale_y_continuous(expand = c(0,0))+
  ggtitle('A: Histogram of Counts')

## Rate data histogram
ggplot(d, aes(x = rate))+
  geom_histogram(binwidth = .1, fill = 'dodgerblue1',color = 'black')+
  scale_x_continuous(breaks = seq(0,10,1))+
  scale_y_continuous(expand = c(0,0))+
  ylab('frequency')+
  ggtitle('B: Histogram of Rate')

Below I can readily simulate counts from my original survey data, but don't know how to properly simulate back data as a rate. For example, if I fit an intercept-only nbinom GLM , I can use the coefficient to simulate new negative-binomial distributions of counts that are very similar to the original data (i.e. uses a similar value for 'mu')

[I realize this seems circular in this example, but this is my approach with real data. First describe the mean value and dispersion 'theta' with a GLM, then simulate back datasets that mimic my original dataset]

I use this approach below both to generate back count data, but also by fitting a model with the offset term in order to simulate back a distribution that has the mean 'rate' from figure 'B'.


### Simulate back count data from the original survey data:

##describe mean value 'mu' by finding intercept
## 'theta'  could also be calculated
m1 = MASS::glm.nb(counts ~ 1, data = d)

# summary(m1)
# mean(d$counts)
# exp(m1$coefficients[1])


## simulated negative-binomial distribution using calculated 'mu'
d.sim = data.frame(new.counts = rnbinom(500,
                                        mu = as.numeric(exp(m1$coefficients[1])),  ## coef on log-scale, exponentiate to use
                                        size = 1))  ## holding dispersion parameter 'theta'  constant at 1


## Plot and compare with plot 'A' above
ggplot(d.sim, aes(x = new.counts))+
  geom_histogram(fill = 'peachpuff3',color = 'black')+
  ylab('frequency')+
  scale_y_continuous(expand = c(0,0))+
  ggtitle('C: Simulated Counts')


###########################################
###########################################

### Simulate back 'rate' data by including an offset term for effort in the GLM model
  ## the exponentiated coefficient should equal the mean of the raw rate data

m2 = MASS::glm.nb(counts ~ 1 + offset(log(time)), data = d)

# summary(m2) 
# mean(d$rate)
# exp(m2$coefficients[1])


d.sim.2 = data.frame(new.counts = rnbinom(500,
                                        mu = as.numeric(exp(m2$coefficients[1])),  ## coef on log-scale, exponentiate to use
                                        size = 1))    ## holding dispersion parameter 'theta'  constant at 1


## compare these simulated 'rate' data with the non-integer 'true rate' data in figure D
ggplot(d.sim.2, aes(x = new.counts))+
  geom_histogram(binwidth = .1, fill = 'dodgerblue3',color = 'black')+
  scale_x_continuous(breaks = seq(0,10,1))+
  ylab('frequency')+
  scale_y_continuous(expand = c(0,0))+
  ggtitle('D: Simulated Rate')

So it is at this point that I've generated figure 'C' as a simulated dataset representing counts that I have observed in real life, which closely matches the original data in figure 'A'. The 'rate' data in figure 'D' is (necessarily) all integer values drawn from rnbinom(), and while the mean of figure 'D' is approximate to the mean of figure 'B', my sense is that these two distributions are not really equivalent.

So my questions again:

Is there a way that I could instead simulate data to match figure 'B' (non-integer rate data)?
Do you think that data in figure 'D' will work as an approximate to 'B' since the mean values (and dispersion) are similar?

For additional context, I'll be using the simulated datasets (many of them) to run other Monte-Carlo type simulation analysis (e.g. power analysis). I'm worried that if I use data generated in Figure 'D', it won't really represent what my actual survey data will be (figure 'B').

Solution

The way you generate your sample data (in place of your empirical data), does not align with the data generating process you describe. The count data from rnbinom(n = 500, mu = 5, size = 1) does not depend on the time. mu needs to be a function of the time variable, or else the counts are independent of time.

Also, setting size = 1 means there is no overdispersion (nor underdispersion), thus it should rather be called a Poisson distribution, which is a special case of the negative binomial distribution. But given your description of the DGP it sounds like there would be overdispersion in the empirical data.

To answer your first question, you see a code example below. Regarding your second question, no I don't think that would be a good idea.

library(tidyverse)
library(rstanarm)
options(mc.cores = parallel::detectCores())

n <- 1000

empirical <-
  tibble(
    time = runif(n, 2, 10),
    count = rnbinom(n = n, mu = time, size = 1) # Generate count data that actually depends on time
    ) |> 
  mutate(rate = count/time)

m_stan <- stan_glm.nb(count ~ time, data = empirical)
 
simulated <- 
  tibble(
    time = runif(n, 2,10),
  ) %>%
  mutate(
    count = posterior_predict(m_stan, ., draws = 1) |> 
      as.vector(),
    rate = count/time
    )

d <- lst(simulated, empirical) |> 
  bind_rows(.id = "data")

d |>
  select(data, count, rate) |> 
  pivot_longer(c(count, rate)) |> 
  ggplot() +
  geom_histogram(aes(value), binwidth = .2) +
  facet_grid(data ~ name, scales = "free")

^{Created on 2022-02-03 by the reprex package (v2.0.1)}