Recommended number of simulated reference datasets for Gap statistic

Is there a recommended number of simulated datasets (B) from the reference distribution when computing Tibshirani's Gap statistic? B=50? B=100? B=500? B=1000? If so, any good reference that mentions it?

Solution

If we go back to the original publication [Tibshirani, Walther and Hastie, J. R. Statist. Soc. B 63, 411 (2011)], the authors define the "1-standard error" rule to determine the optimum number of clusters as the smallest k with

where s_k is the MC simulation-corrected standard error

for B copies of MC samples drawn from the reference distribution.

In the latter equation the square root term allows you to estimate the correction of the standard deviation due to the number of MC samples, and we obviously have

For example, for B = 10, the standard deviation s_k is increased by 5% due to the MC sampling uncertainty. If you choose B = 100, the increase is 0.5%.

I imagine that in practical terms, B = 10 would probably suffice for a lot of applications. But this requires some trial-and-error evaluation of the gap statistic and its standard deviation, based on your actual data and its underlying cluster structure (e.g. the number of well separated vs. less separated clusters).

Some useful references (in no particular order)

Cross Validated: How should I interpret GAP statistic

The Data Science Lab: Finding the K in K-Means Clustering

Tibshirani, Walther and Hastie, J. R. Statist. Soc. B 63, 411 (2011)