I just set up h2o flow UI. I have a csv with the following labels.
Label | Count
0 | 9340
1 | 400
2 | 349
I have imported my file and parsed it. After I do split frame (by 80:20 ratio) I downloaded the 2 csv files to check the label count.
But the split doesn't split to what I expected to be.
I was expecting the data to be split as follows:
Class | Expected 0.8 | Actual 0.8 | Expected 0.2 | Actual 0.2
0 | 7472 | 7418 | 1868 | 1882
1 | 320 | 610 | 80 | 159
2 | 279 | 15 | 69 | 5
How can I split my data into the expected value I wanted above so that I can use it as train and validate frame for model building?
H2O-3's split frame option is not designed to provide exact splits.
H2O-3 is designed to be efficient on big data using a probabilistic splitting method rather than an exact split. For example when specifying a split of 0.75/0.25, H2O-3 will produce a test/train split with an expected value of 0.75/0.25 rather than exactly 0.75/0.25. On small datasets, the sizes of the resulting splits will deviate from the expected value more than on big data, where they will be very close to exact.