Estimate Probability distribution in R

I am planning an experiment to determine the frequency of a binary variable (valued 1 or 0).

Each day, there are 10,000 new events taking place

Each day, I get to draw 100 randomly out of the new 10,000 and see their outcome (either 1 or 0)

How do I estimate the frequency of 1 and 0 in the population with this data?

Is there a package in R that can fit a discrete probability distribution to this data?

Solution

Suppose you have a population of size N=10,000 where 6,500 events occurred one day.

pop <- rep(c(0,1), times=c(3500, 6500))
table(pop)
#pop
#   0    1 
#3500 6500

Now suppose you can sample 100 of these (0,1) events without replacement.

set.seed(123)
N <- 10000
n <- 100
sam <- data.frame(id=1:n, event=sample(pop, size=n), prob=n/N)

table(sam$event)
# 0  1 
#30 70

So we got 70 out of 100. A maximum likelihood estimate of the total events in the population is simply 70/100 x 10,000 = 7,000. The standard error of this estimate is given by

sqrt((N-n)/N * N^2 * var(sam$event)/n)
#[1] 473.71

The 95% confidence interval is [6101 - 7898], which covers the true population total of 6,500. But 1 in 20 days you will likely get a bad sample.

R packages? Not really necessary for this experiment. For more complex sampling designs, I can only think of the survey package, but there may be others.

Now if you did this repeatedly, say for 10 days, you'd get an estimate for each day. An estimate of the total, according to a frequentist statistician, would be the total x N / n and an estimate for the SE calculated in a similar way. For example, suppose you found 3, 4, 5, 11, 6, 8, 14, 8, 17, and 19 "positive" events from samples of 100 on ten consecutive days:

events1 <- c(3, 4, 5, 11, 6, 8, 14, 8, 17, 19)

Which means "negative" or non-events occurring is:

events0 <- 100 - events1

A vector of (0,1) events can be constructed as follows using rep.

events <- rep(rep(c(0,1), each=10), times=c(events0, events1))

Let's define n and N as the number of events in your ten-day sample and in the ten-day population, respectively.

n <- 100 * 10
N <- 10000 * 10

The number of "positive" events in your ten-day sample is:

sum(events1)
#[1] 95

And the MLE in the ten-day population is:

(T <- sum(events1) * N / n)
[1] 9500

The standard error of this ten-day estimate is:

SE <- sqrt((N-n)/N * N^2 * var(events)/n); SE
[1] 923.0409

With 95% CI:

T + c(-1,1) * 1.96*SE
[1]  7690.84 11309.16

A Bayesian may say that each day should be re-estimated or updated based on the previous day's estimate, but I think the results would be fairly similar.

A Bayesian would use Bayes' Rule and use a Uniform(0,1) as a reasonable prior distribution for the proportion of "positive" events for the ten-day period. Unif(0,1) is the same as Beta(1,1). An experienced statistician (Frequentist or Bayesian) would recognise that the beta distribution is conjugate to the binomial distribution. Thus, the Bayesian would use the Beta(1+X, 1+N-X) distribution for the proportion of "positive" events over the ten-day period, where X is the total number of "positive" events (95) and N is the sample size (1000). Note that the mean of Beta(alpha, beta) = alpha/(alpha+beta).

In R:

n <- rep(100, 10)
events1 <- c(3, 4, 5, 11, 6, 8, 14, 8, 17, 19)

X <- sum(events1)
N <- sum(n)

pmean = (1+X)/(2+N); pmean
#[1] 0.09580838

CI = qbeta(c(.025,.975), 1+X, 1+N-X); CI # 95% credible interval
#[1] 0.07837295 0.11477134

Thus, over the ten-day period, the proportion of positive events would be 9.58% of all events and there is a 95% probability that the true proportion lies between 7.84% and 11.48%. Extrapolating to the total population, we can say that 9.58% of 100,000 events or 9,581 events would be positive. This, as I said, is very similar to the frequentist approach.

Meta-analysis

Now, these two methods are effectively combining all ten days into one big sample, and estimating the proportion of positive events, or the total number of positive events, in the whole population. It may be more intuitive to combine the results of each day in a more appropriate way, based on weights, such as is done in a meta-analysis.

If p[k] is the estimated proportion on day k, and se[k] is its standard error, then the combined estimate is given by p_hat = sum(p[k] * w[k]) / sum(w[k]), where w[k] = (1 / se[k])^2, and the standard error is 1 / sqrt(sum(w[k]).

In R:

N <- rep(100000, 10) # Population and 10 day period
n <- rep(100, 10) 

events1 <- c(3, 4, 5, 11, 6, 8, 14, 8, 17, 19)
events0 <- n - events1

p <- NULL; SE <- NULL; w <- NULL

for(k in seq_along(events1)){
  events <- c(rep(0, events0[k]), rep(1, events1[k]))
  p[k] <- sum(events1[k]) / n[k]
  SE[k] <- sqrt((N[k]-n[k]) / N[k] * var(events)/n[k])
  w[k] <- 1 / (SE[k])^2
}

p_hat <- sum(p*w)/sum(w); p_hat
#[1] 0.06997464

SE_p <- 1 / sqrt(sum(w)); SE_p
#[1] 0.007943816

(p_hat + c(-1,1) * 1.96 * SE_p)
#[1] 0.05440476 0.08554452

So about 7% of all events will be positive with 95% confidence interval (5.44% - 8.55%), which is not too different from the two crude methods used above. We get a smaller (and perhaps more accurate) estimate due to the skewed nature of the ten-day sample.