Search code examples
rfrequency-distribution

Frequency table for a random subset in R


Suppose I have a frequency chart of the ten most popular names of babies born in the US in 2006:

myfreq <- c(24835, 22630, 22313, 21398, 20504, 20326, 20054, 19711, 19672, 19400)
names(myfreq) <- c("Jacob", "Michael", "Joshua", "Emily", "Ethan", "Matthew", "Daniel", "Andrew", "Christopher", "Anthony")

> myfreq
      Jacob     Michael      Joshua       Emily       Ethan     Matthew      Daniel 
      24835       22630       22313       21398       20504       20326       20054 
     Andrew Christopher     Anthony 
      19711       19672       19400 

Now consider the set of 210,843 babies with those names, born in the US in 2006. This set has 2^210843 subsets. I want the babyname-frequency chart for a random subset of the babies, with each subset being equally likely. My code is as follows:

subfreq <- sapply(myfreq, function(k) sum(rbinom(k, 1, 0.5)))

Is that doing what I want it to do? And is there some way to improve performance? It's going to be in a loop with millions of iterations, and the rbinom function seems very slow; I wonder if there is a faster function in R for this special case of the binomial distribution where p=1/2. Thanks for any assistance.


Solution

  • Can't be done exactly. You can't construct all the possible subsets, so forget that approach.

    Could be done approximately if you know some math.

    First you need the probability of the sample size being n, which is (in R) naively:

    choose(N, n)/2^N
    

    That will break down for moderate N and n (for example N=1050 and n=525). So you can try logarithms and after some work you get (where lgamma is the log of the gamma function and the gamma function at n+1 is the same as n!) the probability given by:

    exp(lgamma(N+1) - lgamma(n+1) - lgamma(N-n+1) - N*log(2))
    

    To get all the probabilities into one vector we can wrap this into a function:

    pmf <- function(N,n) {
      exp(lgamma(N+1) - lgamma(n+1) - lgamma(N-n+1) - N*log(2))
    }
    
    N <- sum(myfreq)
    probs <- sapply(0:N, function(n) pmf(N,n))
    

    Note that most sample sizes have 0 probability (approximately). Now to select your sample you would first pick a sample size according to the probabilities in probs and then pick a sample of that size from the population of names. We need to make that population first from the frequencies you gave.

    mypop <- rep(mynames, myfreq)
    

    The sample itself:

    sample(mypop, sample(0:N, 1, prob = probs))
    

    To replicate a bunch of times:

    k <- 100
    samps <- replicate(k, sample(mypop, sample(0:N, 1, prob = probs)))
    

    samps is a list of samples of randomly selected sizes.

    Note that the only sample sizes with non-zero probabilities to be selected are:

    range(which(probs > 0))
    #> 96603 114242 
    

    so the properties of your samples aren't going to be as interesting as you might have thought. They will be very close to the population distribution of baby names. Definitely more interesting to have made the babies to begin with.