Search code examples
matlabstatisticsdistributionsampling

MATLAB: Taking sample with same number of values from each class


I have a full dataset of lets say 50000 observations which are assigned to 16 classes. I now want to draw a Sample of let's say 70% of the full data, but I want MATLAB to take the same number of samples from each class (if possible of course, because some classes have less numbers than needed)

Is there a MATLAB function that can do this, or do I have to program a new one for myself? I'm just trying to save time here.

I found cvpartition, but as far as I know this can be used only to take a sample with the same distribution over the classes as the original dataset and not a uniformly distributed sample.

Thank you for your help!


Solution

  • It shouldn't be too hard. Let's say that the observations are in a vector observations. Then you can do

    fraction = 0.7;
    
    classes = unique(observations);
    nObs = length(observations);
    nClasses = length(classes);
    nSamples = round(nObs * fraction / nClasses);
    
    for ii = 1:nClasses
        idx = observations == classes(ii);
        samples((ii-1)*nSamples+1:ii*nSamples) = randsample(observations(idx), nSamples);
    end
    

    Now samples is a vector of length nClasses * nsamples that contains your sampled observations, with an equal number from each class.

    At the moment it will fail if one of the classes doesn't contain at least nSamples observations. The simplest fix is to add the additional arguments 'replace','true' to the call to randsample, which will tell it to replace each observation after being picked.