rfunctionstatisticsdata-sciencestatistics-bootstrap# R Function For Coupon Collector's Function with Finite Number of Trials

I'm trying to write up a function that runs the Coupon Collector's Problem in R. I'm trying to make a function that will spit out the number of unique coupons collected from sample size n with a finite number of trials t. (For example, having 500 unique coupons in a sample space, and picking a coupon from it 1000 times with replacement.)

I was able to write a similar function that returns how many times you have to sample it to collect ALL tickets. However, that function only requires specifying n, and does not account for the finite number of trials.

So far, this is how I have attempted to go about writing this function, with the function calling for n and t.

I assumed that if the length of the trials was less than t for the while loop, I could keep counting the unique coupons until trials == t, then return the number x out of n tickets collected after t trials. So far, all R has done is stall indefinitely.

```
Collectathon2 <- function(n, t){
Coupons <- 1:n
Collected_Coupons <- c()
Trials <- 0
while (length(Trials) < t) {
New_Coupon <- sample(n, 1)
Collected_Coupons <- unique(c(Collected_Coupons, New_Coupon))
Unique_Coupons <- length(Collected_Coupons)
Trials <- Trials + 1
}
return(Unique_Coupons)
}
```

I assumed I would only need to modify this function below, which will run until it gets all of the tickets. This one works and spits a number out pretty quickly.

```
Collectathon <- function(n){
Collected_Coupons <- c()
Trials <- 0
while (length(Collected_Coupons) < n) {
New_Coupon <- sample(n, 1)
Collected_Coupons <- unique(c(Collected_Coupons, New_Coupon))
Trials <- Trials + 1
}
return(Trials)
}
```

Solution

If the coupons all have an equal probability of being drawn on each draw, then they are indistinguishable, and the only interesting thing is the number of unique coupons (`k`

) after `t`

draws. Once that number is known, one can easily simulate *which* coupons were collected with `sample(k, t)`

.

A naive function to return the result of `n`

simulations of `k`

(the number of unique coupons collected from a set of `N`

unique coupons after `t`

random uniform draws):

```
library(data.table) # for `uniqueN`
rcoupon1 <- function(n, N, t) replicate(n, uniqueN(sample(N, t, 1)))
```

If many samples of `k`

are desired for the same `N`

and `t`

, it will be more efficient to first calculate the full probability mass function of `k`

and use that to sample `k`

. This CV answer provides a way to efficiently calculate the PMF of `k`

.

```
library(copula)
Rcpp::cppFunction("
NumericVector dcoupon_exact(const int& N, const int& t) {
int maxk;
int n1 = N - 1;
if (N > t) {
maxk = t;
} else {
maxk = N;
}
NumericVector k (maxk);
k(0) = 1;
for (int i = 1; i < maxk; i++) {
for (int j = i; j > 0; j--) {
k(j) = (k(j - 1)*(N - j) + k(j)*(j + 1))/N;
}
k(0) = k(0)/N;
}
for (int i = maxk; i < t; i++) {
for (int j = n1; j > 0; j--) {
k(j) = (k(j - 1)*(N - j) + k(j)*(j + 1))/N;
}
k(0) = k(0)/N;
}
return k;
}
")
dcoupon <- function(N, t, exact = TRUE) {
l <- min(t, N)
k <- 1:l
if (t < 200) {
logS <- log(Stirling2.all(t)[k])
} else {
if (exact) return(dcoupon_exact(N, t))
# estimate the log Stirling numbers
v <- t/k
G <- 1/v
vexpv <- v/exp(v)
# Newton-Raphson
for (i in 1:5) G <- G - (G - (vexpG <- vexpv*exp(G)))/(1 - vexpG)
logS <- (log(v - 1) - log(v*(1 - G)))/2 +
(t - k)*(log(v - 1) - log(v - G)) + t*log(k) - k*log(t) + k*(1 - G) +
lgamma(t + 1) - lgamma(k + 1) - lgamma(t - k + 1)
if (l == t) logS[t] <- 0
}
pmf <- exp(logS + lgamma(N + 1) - lgamma(N - k + 1) - t*log(N))
pmf/sum(pmf)
}
rcoupon2 <- function(n, N, t, exact = TRUE) sample(N, n, 1, dcoupon(N, t, exact))
```

Benchmarking `n = 1e4`

, `N = 500`

, `t = 1000`

:

```
set.seed(39634386)
microbenchmark::microbenchmark(
rcoupon1 = {k1 <- rcoupon1(1e4, 500L, 1e3L)},
rcoupon2.exact = {k2 <- rcoupon2(1e4, 500L, 1e3L)},
rcoupon2.approx = {k3 <- rcoupon2(1e4, 500L, 1e3L, FALSE)}
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> rcoupon1 1141546.6 1208177.20 1390417.264 1248621.90 1378973.1 4027339.8 100
#> rcoupon2.exact 4246.5 4340.90 5442.147 4411.35 5312.8 21677.0 100
#> rcoupon2.approx 653.0 689.35 1755.478 712.10 770.2 97984.3 100
```

Quick check that the distributions of `k`

are the same:

```
ks.test(k1, k2)
#> Warning in ks.test.default(k1, k2): p-value will be approximate in the presence
#> of ties
#>
#> Asymptotic two-sample Kolmogorov-Smirnov test
#>
#> data: k1 and k2
#> D = 0.0098, p-value = 0.7229
#> alternative hypothesis: two-sided
ks.test(k1, k3)
#> Warning in ks.test.default(k1, k3): p-value will be approximate in the presence
#> of ties
#>
#> Asymptotic two-sample Kolmogorov-Smirnov test
#>
#> data: k1 and k3
#> D = 0.0072, p-value = 0.9578
#> alternative hypothesis: two-sided
```

Timing a larger problem (`n = N = t = 1e4`

):

```
system.time(rcoupon1(1e4, 1e4L, 1e4L))
#> user system elapsed
#> 10.89 0.51 12.11
system.time(rcoupon2(1e4, 1e4L, 1e4L))
#> user system elapsed
#> 0.61 0.00 0.63
system.time(rcoupon2(1e4, 1e4L, 1e4L, FALSE))
#> user system elapsed
#> 0.02 0.00 0.02
```

The approximate PMF (which is a *very* good approximation for large `N`

and `t`

) can easily handle a very large sample of `k`

for a very large coupon collection problem. For example, let's take 1M samples of `k`

for 10M draws from a set of 1M coupons (`n = 1e6`

, `N = 1e6`

, `t = 1e7`

):

```
system.time(rcoupon2(1e6, 1e6, 1e7, FALSE))
#> user system elapsed
#> 1.18 0.02 1.21
```

Performing the same calculation using the naive function would take about a week and a half on my (very old) machine:

```
system.time(rcoupon1(10, 1e6, 1e7))
#> user system elapsed
#> 12.07 0.43 9.71
```

- Installing R on Linux: configure: error: libcurl >= 7.28.0 library and headers are required with support for https
- How to do ensembles with time series using AICc?
- planes3d expands and draws the area based on the sphere's radius
- How to extract tag code itself using R, rvest
- How to Display or Print Contents of Environment in R
- How to use Windows user credentials for proxy authentication in R/RStudio
- R reticulate specifying python executable to use
- Replace multiple Instances of a variable name in an R function and save the modified function
- Standardizing address formatting in R
- How to fix "failed to load cairo DLL" in R?
- Using grepl to filter columns names in specific range of columns
- changing the legends in ggplot2 to have groups of similar labels
- How to keep only unique rows but ignore a column?
- convert string date to R Date FAST for all dates
- Add subgroup text to plotly pie chart
- R Shiny : adjust height of DT datatable when fillContainer=TRUE,
- Why do R external pointers' "unusual copying semantics" mean they should not be used stand-alone?
- How to extract somo character after a string with a number of word which can change in R
- What does `se` stand for in geom_smooth(..., se = FALSE)?
- How to find number of rows greater than any values in R
- Align text and reduce space between text and parentheses in plotly hover info box
- Remove outer box of geom_bar plot with broken y-axis
- How to use lag/lead in mutate with an initial value?
- Is it possible to have a Shiny ConditionalPanel whose condition is a global variable?
- counting elements in one list in another list
- How to vectorize nested loops in R?
- Replace NA values with an incrementing sequence starting from the previous non-NA value
- How can I calculate the number of uniques in a row within a species matrix?
- How to perform operations on pairs of rows, based on a "distinguishing" column's values
- Mutate variable based on previous observations