rfunctionstatisticsdata-sciencestatistics-bootstrap

R Function For Coupon Collector's Function with Finite Number of Trials


I'm trying to write up a function that runs the Coupon Collector's Problem in R. I'm trying to make a function that will spit out the number of unique coupons collected from sample size n with a finite number of trials t. (For example, having 500 unique coupons in a sample space, and picking a coupon from it 1000 times with replacement.)

I was able to write a similar function that returns how many times you have to sample it to collect ALL tickets. However, that function only requires specifying n, and does not account for the finite number of trials.

So far, this is how I have attempted to go about writing this function, with the function calling for n and t.

I assumed that if the length of the trials was less than t for the while loop, I could keep counting the unique coupons until trials == t, then return the number x out of n tickets collected after t trials. So far, all R has done is stall indefinitely.

Collectathon2 <- function(n, t){
      Coupons <- 1:n
      Collected_Coupons <- c()
      Trials <- 0
      
      while (length(Trials) < t) {
        New_Coupon <- sample(n, 1)
        Collected_Coupons <- unique(c(Collected_Coupons, New_Coupon))
        Unique_Coupons <- length(Collected_Coupons)
        Trials <- Trials + 1
      }
      return(Unique_Coupons)
    }

I assumed I would only need to modify this function below, which will run until it gets all of the tickets. This one works and spits a number out pretty quickly.

Collectathon <- function(n){
      Collected_Coupons <- c()
      Trials <- 0
      
      while (length(Collected_Coupons) < n) {
        New_Coupon <- sample(n, 1)
        Collected_Coupons <- unique(c(Collected_Coupons, New_Coupon))
        Trials <- Trials + 1
        }
      return(Trials)
    }

Solution

  • If the coupons all have an equal probability of being drawn on each draw, then they are indistinguishable, and the only interesting thing is the number of unique coupons (k) after t draws. Once that number is known, one can easily simulate which coupons were collected with sample(k, t).

    A naive function to return the result of n simulations of k (the number of unique coupons collected from a set of N unique coupons after t random uniform draws):

    library(data.table) # for `uniqueN`
    
    rcoupon1 <- function(n, N, t) replicate(n, uniqueN(sample(N, t, 1)))
    

    If many samples of k are desired for the same N and t, it will be more efficient to first calculate the full probability mass function of k and use that to sample k. This CV answer provides a way to efficiently calculate the PMF of k.

    library(copula)
    
    Rcpp::cppFunction("
      NumericVector dcoupon_exact(const int& N, const int& t) {
        int maxk;
        int n1 = N - 1;
        if (N > t) {
          maxk = t;
        } else {
          maxk = N;
        }
        
        NumericVector k (maxk);
        k(0) = 1;
        
        for (int i = 1; i < maxk; i++) {
          for (int j = i; j > 0; j--) {
            k(j) = (k(j - 1)*(N - j) + k(j)*(j + 1))/N;
          }
          k(0) = k(0)/N;
        }
        
        for (int i = maxk; i < t; i++) {
          for (int j = n1; j > 0; j--) {
            k(j) = (k(j - 1)*(N - j) + k(j)*(j + 1))/N;
          }
          k(0) = k(0)/N;
        }
        
        return k;
      }
    ")
    
    dcoupon <- function(N, t, exact = TRUE) {
      l <- min(t, N)
      k <- 1:l
      if (t < 200) {
        logS <- log(Stirling2.all(t)[k])
      } else {
        if (exact) return(dcoupon_exact(N, t))
        # estimate the log Stirling numbers
        v <- t/k
        G <- 1/v
        vexpv <- v/exp(v)
        # Newton-Raphson
        for (i in 1:5) G <- G - (G - (vexpG <- vexpv*exp(G)))/(1 - vexpG)
        logS <- (log(v - 1) - log(v*(1 - G)))/2 +
          (t - k)*(log(v - 1) - log(v - G)) + t*log(k) - k*log(t) + k*(1 - G) +
          lgamma(t + 1) - lgamma(k + 1) - lgamma(t - k + 1)
        if (l == t) logS[t] <- 0
      }
      pmf <- exp(logS + lgamma(N + 1) - lgamma(N - k + 1) - t*log(N))
      pmf/sum(pmf)
    }
    
    rcoupon2 <- function(n, N, t, exact = TRUE) sample(N, n, 1, dcoupon(N, t, exact))
    

    Benchmarking n = 1e4, N = 500, t = 1000:

    set.seed(39634386)
    
    microbenchmark::microbenchmark(
      rcoupon1 = {k1 <- rcoupon1(1e4, 500L, 1e3L)},
      rcoupon2.exact = {k2 <- rcoupon2(1e4, 500L, 1e3L)},
      rcoupon2.approx = {k3 <- rcoupon2(1e4, 500L, 1e3L, FALSE)}
    )
    #> Unit: microseconds
    #>             expr       min         lq        mean     median        uq        max neval
    #>         rcoupon1 1141546.6 1208177.20 1390417.264 1248621.90 1378973.1  4027339.8   100
    #>   rcoupon2.exact    4246.5    4340.90    5442.147    4411.35    5312.8    21677.0   100
    #>  rcoupon2.approx     653.0     689.35    1755.478     712.10     770.2    97984.3   100
    

    Quick check that the distributions of k are the same:

    ks.test(k1, k2)
    #> Warning in ks.test.default(k1, k2): p-value will be approximate in the presence
    #> of ties
    #> 
    #>  Asymptotic two-sample Kolmogorov-Smirnov test
    #> 
    #> data:  k1 and k2
    #> D = 0.0098, p-value = 0.7229
    #> alternative hypothesis: two-sided
    ks.test(k1, k3)
    #> Warning in ks.test.default(k1, k3): p-value will be approximate in the presence
    #> of ties
    #> 
    #>  Asymptotic two-sample Kolmogorov-Smirnov test
    #> 
    #> data:  k1 and k3
    #> D = 0.0072, p-value = 0.9578
    #> alternative hypothesis: two-sided
    

    Timing a larger problem (n = N = t = 1e4):

    system.time(rcoupon1(1e4, 1e4L, 1e4L))
    #>    user  system elapsed 
    #>   10.89    0.51   12.11
    system.time(rcoupon2(1e4, 1e4L, 1e4L))
    #>    user  system elapsed 
    #>    0.61    0.00    0.63
    system.time(rcoupon2(1e4, 1e4L, 1e4L, FALSE))
    #>    user  system elapsed 
    #>    0.02    0.00    0.02
    

    The approximate PMF (which is a very good approximation for large N and t) can easily handle a very large sample of k for a very large coupon collection problem. For example, let's take 1M samples of k for 10M draws from a set of 1M coupons (n = 1e6, N = 1e6, t = 1e7):

    system.time(rcoupon2(1e6, 1e6, 1e7, FALSE))
    #>    user  system elapsed 
    #>    1.18    0.02    1.21
    

    Performing the same calculation using the naive function would take about a week and a half on my (very old) machine:

    system.time(rcoupon1(10, 1e6, 1e7))
    #>    user  system elapsed 
    #>   12.07    0.43    9.71