Is there a recommended number of simulated datasets (B) from the reference distribution when computing Tibshirani's Gap statistic? B=50? B=100? B=500? B=1000? If so, any good reference that mentions it?
If we go back to the original publication [Tibshirani, Walther and Hastie, J. R. Statist. Soc. B 63, 411 (2011)], the authors define the "1-standard error" rule to determine the optimum number of clusters as the smallest k with
where s_k is the MC simulation-corrected standard error
for B copies of MC samples drawn from the reference distribution.
In the latter equation the square root term allows you to estimate the correction of the standard deviation due to the number of MC samples, and we obviously have
For example, for B = 10
, the standard deviation s_k is increased by 5% due to the MC sampling uncertainty. If you choose B = 100
, the increase is 0.5%.
I imagine that in practical terms, B = 10
would probably suffice for a lot of applications. But this requires some trial-and-error evaluation of the gap statistic and its standard deviation, based on your actual data and its underlying cluster structure (e.g. the number of well separated vs. less separated clusters).
Some useful references (in no particular order)
Cross Validated: How should I interpret GAP statistic
The Data Science Lab: Finding the K in K-Means Clustering
Tibshirani, Walther and Hastie, J. R. Statist. Soc. B 63, 411 (2011)