Search code examples
rdplyrsamplingstatistics-bootstrapstatistical-test

Define sample size using simple random sampling


I am trying to run a PCA, but I have too much data (20k observations) the resolution is too low. I am using sample_n(df, replace = TRUE, n) [from dplyr] to reduce the size and have a better fit.

My question is: what is the best technique to define (or estimate) the sample size (n)? If I have 20k observations (different sites, different times of the year, relatively well homogeneous), which cutoff should I use: 5%, 10%, 20%?

Could you give me a reference to your suggestion?

Thank you in advance for your comments.


Solution

  • I would make a loop with different sample sizes, I dont believe there is a clear cut/off just you could do with train/test (although we have piplines, but you know what I mean the 70/30 cutoff). The only thing I would check is if sample_n is still not too clustered and values are relatively equally represented.

    If you are firm with k-means clustering, there we have the "elbow method", which is a little bit subjective where is the best amount of clusters (although we measure RSS), you just have to try a lot of iterations and loops.

    You know with neural networks when you have e.g. a million observations you can reduce test set to e.g. 5 or 10 % because in absolute values you still have a lot of cases.

    In summary: I think that it needs a practical test like the elbow method in clustering. Becaue its can be very specific to your data.

    I hope my answer is to at least to some value to you, I have no journal reference atm.