How to measure image classification model robustness?

Image Classification models trained on animal classification data like iNaturalist or iWildcam sometimes developed spurious correlations with the background. How to measure model performance limitations caused only by such spurious correlations as opposed to other plausible (non-spurious) reasons (i.e 2 animals do look a lot like each other) ?!

Solution

Google [1],[4] defines In-Distribution Robustness as a model's performance on the same data hold-out test set. While Out-of-Distribution robustness (which is the focus of the question) is the model's performance on classifying the same object but on a different dataset. Benchmark datasets Google used to demo their new state-of-the-art model "ViT-Plex" were: CIFAR10Vs100 [2], CIFAR100Vs10, ImageNet Vs.Places3 and RETINA. Also in PapersWihCode, there are multiple other benchmark datasets for OOD Detection [3].

[1] https://ai.googleblog.com/2022/07/towards-reliability-in-deep-learning.html

[2] https://paperswithcode.com/sota/out-of-distribution-detection-on-cifar-100-vs

[3] https://paperswithcode.com/task/out-of-distribution-detection

[4]https://arxiv.org/pdf/2207.07411.pdf