Search code examples
rsimulationprobability

Having trouble solving simulation


I got a question related to probability theory and I tried to solve it by simulating it in R. However, I ran into a problem as the while loop does not seem to break.

The question is asking: How many people are needed such that there is at least a 70% chance that one of them is born on the last day of December?

Here is my code:

prob <- 0 
people <- 1 

while (prob <= 0.7) {
  people <- people + 1 #start the iteration with 2 people in the room and increase 1 for every iteration
  birthday <- sample(365, size = people, replace = TRUE) 
  prob <- length(which(birthday == 365)) / people
}
return(prob)

My guess is that it could never hit 70%, therefore the while loop never breaks, am I right? If so, did I interpret the question wrongly?

I did not want to post this on stats.stackexchange.com because I thought this is more related to code rather than math itself, but I will move it if necessary, thanks.


Solution

  • This is a case where an analytical solution based on probability is easier and more accurate than trying to simulate. I agree with Harshvardhan that your formulation is solving the wrong problem.

    The probability of having at least one person in a pool of n have their birthday on a particular target date is 1-P{all n miss the target date}. This probability is at least 0.7 when P{all n miss the target date} < 0.3. The probability of each individual missing the target is assumed to be P{miss} = 1-1/365 (365 days per year, all birthdates equally likely). If the individual birthdays are independent, then P{all n miss the target date} = P{miss}^n.

    I am not an R programmer, but the following Ruby should translate pretty easily:

    # Use rationals to avoid cumulative float errors.
    # Makes it slower but accurate.
    P_MISS_TARGET = 1 - 1/365r   
    p_all_miss = P_MISS_TARGET
    threshold = 3r / 10   # seeking P{all miss target} < 0.3
    n = 1
    while p_all_miss > threshold
      p_all_miss *= P_MISS_TARGET
      n += 1
    end
    puts "With #{n} people, the probability all miss is #{p_all_miss.to_f}"
    

    which produces:

    With 439 people, the probability all miss is 0.29987476838793214


    Addendum

    I got curious, since my answer differs from the accepted one, so I wrote a small simulation. Again, I think it's straightforward enough to understand even though it's not in R:

    require 'quickstats'  # Stats "gem" available from rubygems.org
    
    def trial
      n = 1
      # Keep adding people to the count until one of them hits the target
      n += 1 while rand(1..365) != 365
      return n
    end
    
    def quantile(percentile = 0.7, number_of_trials = 1_000)
      # Create an array containing results from specified number of trials.
      # Defaults to 1000 trials
      counts = Array.new(number_of_trials) { trial }
      # Sort the array and determine the empirical target percentile.
      # Defaults to 70th percentile
      return counts.sort[(percentile * number_of_trials).to_i]
    end
    
    # Tally the statistics of 100 quantiles and report results,
    # including margin of error, formatted to 3 decimal places.
    stats = QuickStats.new
    100.times { stats.new_obs(quantile) }
    puts "#{"%.3f" % stats.avg}+/-#{"%.3f" % (1.96*stats.std_err)}"
    

    Five runs produce outputs such as:

    440.120+/-3.336
    440.650+/-3.495
    435.820+/-3.558
    439.500+/-3.738
    442.290+/-3.909
    

    which is strongly consistent with the analytical result derived earlier and seems to differ significantly from other responder's answers.

    Note that on my machine the simulation takes roughly 40 times longer than the analytical calculation, is more complex, and introduces uncertainty. To increase the precision you would need larger sample sizes, and thus longer run times. Given these considerations, I would reiterate my advice to go for the direct solution in this case.