Suppose I have a frequency chart of the ten most popular names of babies born in the US in 2006:
myfreq <- c(24835, 22630, 22313, 21398, 20504, 20326, 20054, 19711, 19672, 19400)
names(myfreq) <- c("Jacob", "Michael", "Joshua", "Emily", "Ethan", "Matthew", "Daniel", "Andrew", "Christopher", "Anthony")
> myfreq
Jacob Michael Joshua Emily Ethan Matthew Daniel
24835 22630 22313 21398 20504 20326 20054
Andrew Christopher Anthony
19711 19672 19400
Now consider the set of 210,843 babies with those names, born in the US in 2006. This set has 2^210843 subsets. I want the babyname-frequency chart for a random subset of the babies, with each subset being equally likely. My code is as follows:
subfreq <- sapply(myfreq, function(k) sum(rbinom(k, 1, 0.5)))
Is that doing what I want it to do? And is there some way to improve performance? It's going to be in a loop with millions of iterations, and the rbinom function seems very slow; I wonder if there is a faster function in R for this special case of the binomial distribution where p=1/2. Thanks for any assistance.
Can't be done exactly. You can't construct all the possible subsets, so forget that approach.
Could be done approximately if you know some math.
First you need the probability of the sample size being n
, which is (in R
) naively:
choose(N, n)/2^N
That will break down for moderate N
and n
(for example N=1050
and n=525
). So you can try logarithms and after some work you get (where lgamma
is the log of the gamma function and the gamma function at n+1 is the same as n!) the probability given by:
exp(lgamma(N+1) - lgamma(n+1) - lgamma(N-n+1) - N*log(2))
To get all the probabilities into one vector we can wrap this into a function:
pmf <- function(N,n) {
exp(lgamma(N+1) - lgamma(n+1) - lgamma(N-n+1) - N*log(2))
}
N <- sum(myfreq)
probs <- sapply(0:N, function(n) pmf(N,n))
Note that most sample sizes have 0 probability (approximately). Now to select your sample you would first pick a sample size according to the probabilities in probs
and then pick a sample of that size from the population of names. We need to make that population first from the frequencies you gave.
mypop <- rep(mynames, myfreq)
The sample itself:
sample(mypop, sample(0:N, 1, prob = probs))
To replicate a bunch of times:
k <- 100
samps <- replicate(k, sample(mypop, sample(0:N, 1, prob = probs)))
samps
is a list of samples of randomly selected sizes.
Note that the only sample sizes with non-zero probabilities to be selected are:
range(which(probs > 0))
#> 96603 114242
so the properties of your samples aren't going to be as interesting as you might have thought. They will be very close to the population distribution of baby names. Definitely more interesting to have made the babies to begin with.