I have a full dataset of lets say 50000 observations which are assigned to 16 classes. I now want to draw a Sample of let's say 70% of the full data, but I want MATLAB to take the same number of samples from each class (if possible of course, because some classes have less numbers than needed)
Is there a MATLAB function that can do this, or do I have to program a new one for myself? I'm just trying to save time here.
I found cvpartition
, but as far as I know this can be used only to take a sample with the same distribution over the classes as the original dataset and not a uniformly distributed sample.
Thank you for your help!
It shouldn't be too hard. Let's say that the observations are in a vector observations
. Then you can do
fraction = 0.7;
classes = unique(observations);
nObs = length(observations);
nClasses = length(classes);
nSamples = round(nObs * fraction / nClasses);
for ii = 1:nClasses
idx = observations == classes(ii);
samples((ii-1)*nSamples+1:ii*nSamples) = randsample(observations(idx), nSamples);
end
Now samples
is a vector of length nClasses * nsamples
that contains your sampled observations, with an equal number from each class.
At the moment it will fail if one of the classes doesn't contain at least nSamples
observations. The simplest fix is to add the additional arguments 'replace','true'
to the call to randsample
, which will tell it to replace each observation after being picked.