Search code examples
matlabmultivariate-testingmultivariate-partition

Comparing multivariate distributions


I have a set of multivariate instances and I need to extract a representative set from these instances; for instance if I have 100,000 multivariate instances, I want to extract 1000 instances that would be representative of the original distribution. I used Latin Hypercube Sampling and Random Sampling to extract two representative sets and now I want to check how much of a correlation these two representative sets have with the original set.

If I further elaborate;

I have 100,000 multivariate instances (let's call it A)

I derive two representative samples from 'A' (each set will have 1000 instances; let's call these two sets B and C)

I want to check whether 'B' and 'C' preserves the distribution of the original 'A'.

Thanks a lot in advance!


Solution

  • This is more of a statistics question, but here's an outline. Normally you'd use a Chi-squared test to compare the distributions. The basic steps are as follows.

    1. Bin each of the data sets. Try to set up the bins so that there's at least 5 or more samples in each bin. (Use the same bins for all data sets).

    2. Use the large sample "A" to determine the expected number of samples (call it f_e) in each bin. (BTW. Note that f_e for any particular bin would be 1/100 of the number samples in that particular bin, since sample A contains 100 times the data points of B or C).

    3. To test one of the samples (say B) calculate the sum: S = sum over all bins of (f_o - f_e)^2/fe, where f_o is the observed frequency in the bin.

    4. This sum is a Chi-squared variable with degrees of freedom one less than the total number of bins that you are using.

    5. Calculate 1 - chi2cdf(S,dof). This is the probability that a sum as large or larger than the one you obtained (S), could have happened purely due to random variations (that is, even if the distribution were identical). So a small result (close to 0) means that the distribution are likely to be different, and a large result (close to 1) means they're not likely to be significantly different.

    There's probably a library function to do all of the above. IDK, as I haven't used any statistics libraries for a long while.