machine-learning classification sampling multilabel-classification azure-machine-learning-service

Azure machine learning even sampling

I'm trying to do some basic multi-label classification in Azure ML. I have some basic data in the following format:

value_x value_y label
x1      y1      label1
x2      y2      label1
x3      y3      label2
.....

My problem is that in my data certain labels (out of a total of five) are overrepresented, as about 40% of the data is label1, about 20% is label 2 and the rest around 10%.

I would like to get a sampling out of these to train my model, so that each label is represented in equal amounts.

Tried the stratification option in the Sampling module on the labels column, but that just gives me a sampling with the same distribution of labels as in the initial dataset.

Any idea how I could do this with a module?

Solution

I was able to do this using a combination of Split Data, Partition and Sample, and Add Rows modules. There may be an easier way to do it, but I did confirm it works. :) I published my work at http://gallery.azureml.net/Details/1245147fd7004e91bc7a3683cda19cc7 so you can grab it directly from there, and run to confirm it does what you expect.

Since you said you wanted a sampling of the data, I just reduced each of the labels to 10% to have all labels represented equally. Since you have a good understanding of the distribution in your dataset, leave label 3, 4, and 5 all at about 10%, and reduce label 1 by 1/4 and label 2 by 1/2 to get about 10% of them as well.

To explain what I did in the workspace linked above:

I used some "Split Data" modules to filter out the label1 and label2 data. In the Split Data module, change the Splitting mode to "Regular Expression" and set the regular expression to \"Label" ^label1 (to get the label1 data, for example).
Then I used some "Partition and Sample" modules to reduce the size of the label1 and label2 data appropriately.
Finally, I used some "Add Rows" modules to join all of the data back together again.

Finally, I didn't include this in my work, but you can also look at the SMOTE module. It will increase the number of low-occurring samples using synthetic minority oversampling.