Search code examples
machine-learninggoogle-cloud-automl

What causes Google Auto ML to error out with "Missing labels in training/test/eval split"?


I have a training set of 1.6 M records and my target has 493 unique values (categorical data).

I am getting an error saying

Missing label(s) in train split: target column contains 493 distinct values, but only 485 present.
Missing label(s) in test split: target column contains 493 distinct values, but only 403 present. 
Missing label(s) in eval split: target column contains 493 distinct values, but only 403 present. 
There must be at least one instance of each label value in every split.

What do they mean then saying present are there empty values in my data set or what kind of errors do I look in my training data set?


Solution

  • The root cause of this issue is that you do not have enough data of one, or several, of the distinct labels. That causes one of the values to not appear on one of the sets. By default AutoML would split the data into Train, Eval and Test as 80%, 10%, 10%. This may depend on which section of AutoML are you using but I think that it is roughly the same for all (see AutoML Tables or AutoML Vision).

    It is required that all the labels exist in all the three splits. Therefore, even if all the labels are there in the original data, it does not mean that the label will be in all the splits. Given the amount of data you have, this may happen if some labels are uncommon (low ratio).

    Solution for this are:

    • Add more data that contains the troublesome labels. Try to balance the amount of data you have of each label to ensure all of them will appear in each split.
    • Use the data split column, the time column or any other of the other known methods to manually specify each of the splits.
    • Remove the labels that have too small occurrences. This can be a quick solution if you are unable to follow any of the aforementioned ones

    Alternatively, this can be due to some data not properly recognized by AutoML which results in invalid data and those records being discarded which ultimately results in the aforementioned root cause. If that is the case, which can be checked by trying the aforementioned solutions and failing to solve the problem, I recommend reaching GCP support as it is probably an issue in the AutoML feature.