Search code examples
splitdataseth2o

H2o Flow UI: How Split Frame works for multiclass dataset?


I just set up h2o flow UI. I have a csv with the following labels.

Label | Count
0     | 9340
1     | 400
2     | 349

I have imported my file and parsed it. After I do split frame (by 80:20 ratio) I downloaded the 2 csv files to check the label count.

But the split doesn't split to what I expected to be.

I was expecting the data to be split as follows:

Class | Expected 0.8 | Actual 0.8 | Expected 0.2 | Actual 0.2
0     | 7472         | 7418       | 1868         | 1882
1     | 320          | 610        | 80           | 159
2     | 279          | 15         | 69           | 5

How can I split my data into the expected value I wanted above so that I can use it as train and validate frame for model building?


Solution

  • H2O-3's split frame option is not designed to provide exact splits.

    H2O-3 is designed to be efficient on big data using a probabilistic splitting method rather than an exact split. For example when specifying a split of 0.75/0.25, H2O-3 will produce a test/train split with an expected value of 0.75/0.25 rather than exactly 0.75/0.25. On small datasets, the sizes of the resulting splits will deviate from the expected value more than on big data, where they will be very close to exact.