I have a set of multivariate instances and I need to extract a representative set from these instances; for instance if I have 100,000 multivariate instances, I want to extract 1000 instances that would be representative of the original distribution. I used Latin Hypercube Sampling and Random Sampling to extract two representative sets and now I want to check how much of a correlation these two representative sets have with the original set.
If I further elaborate;
I have 100,000 multivariate instances (let's call it A)
I derive two representative samples from 'A' (each set will have 1000 instances; let's call these two sets B and C)
I want to check whether 'B' and 'C' preserves the distribution of the original 'A'.
Thanks a lot in advance!
This is more of a statistics question, but here's an outline. Normally you'd use a Chi-squared test to compare the distributions. The basic steps are as follows.
Bin each of the data sets. Try to set up the bins so that there's at least 5 or more samples in each bin. (Use the same bins for all data sets).
Use the large sample "A" to determine the expected number of samples (call it f_e) in each bin. (BTW. Note that f_e for any particular bin would be 1/100 of the number samples in that particular bin, since sample A contains 100 times the data points of B or C).
To test one of the samples (say B) calculate the sum: S = sum over all bins of (f_o - f_e)^2/fe, where f_o is the observed frequency in the bin.
This sum is a Chi-squared variable with degrees of freedom one less than the total number of bins that you are using.
Calculate 1 - chi2cdf(S,dof). This is the probability that a sum as large or larger than the one you obtained (S), could have happened purely due to random variations (that is, even if the distribution were identical). So a small result (close to 0) means that the distribution are likely to be different, and a large result (close to 1) means they're not likely to be significantly different.
There's probably a library function to do all of the above. IDK, as I haven't used any statistics libraries for a long while.