I am trying to run a PCA, but I have too much data (20k observations) the resolution is too low. I am using sample_n(df, replace = TRUE, n) [from dplyr] to reduce the size and have a better fit.
My question is: what is the best technique to define (or estimate) the sample size (n)? If I have 20k observations (different sites, different times of the year, relatively well homogeneous), which cutoff should I use: 5%, 10%, 20%?
Could you give me a reference to your suggestion?
Thank you in advance for your comments.
I would make a loop with different sample sizes, I dont believe there is a clear cut/off just you could do with train/test (although we have piplines, but you know what I mean the 70/30 cutoff). The only thing I would check is if sample_n is still not too clustered and values are relatively equally represented.
If you are firm with k-means clustering, there we have the "elbow method", which is a little bit subjective where is the best amount of clusters (although we measure RSS), you just have to try a lot of iterations and loops.
You know with neural networks when you have e.g. a million observations you can reduce test set to e.g. 5 or 10 % because in absolute values you still have a lot of cases.
In summary: I think that it needs a practical test like the elbow method in clustering. Becaue its can be very specific to your data.
I hope my answer is to at least to some value to you, I have no journal reference atm.