Search code examples
k-meansshap

K-means clustering and nsamples for KernelExplainer


I have a dataset which contains roughly 50,000 observations. I want to compute the Shapley value using KernelExplainer after estimating an ElasticNet for regression. Is there any reference or rule that determines the value of K and nsamples? Thank you very much.

I tried K=10 and nsamples=100 but the plot of Shapley value for each feature is usually a upward or downward sloping line.In some ocassions, there are only two point in the plot.


Solution

  • You would typically use K = 20-100 centroids as background data.

    A good value for nsamples is of the order $p(p+1) + 200$, where $p$ is the number of features. The KernelExplainer is implemented in a very smart way that would list all important $p(p+1)$ on-off (masking) combinations. The 200 additional on-off samples will cover the less important part of the KernelSHAP distribution.