I have a training set of 1.6 M records and my target has 493 unique values (categorical data).
I am getting an error saying
Missing label(s) in train split: target column contains 493 distinct values, but only 485 present.
Missing label(s) in test split: target column contains 493 distinct values, but only 403 present.
Missing label(s) in eval split: target column contains 493 distinct values, but only 403 present.
There must be at least one instance of each label value in every split.
What do they mean then saying present
are there empty values in my data set or what kind of errors do I look in my training data set?
The root cause of this issue is that you do not have enough data of one, or several, of the distinct labels. That causes one of the values to not appear on one of the sets. By default AutoML would split the data into Train, Eval and Test as 80%, 10%, 10%. This may depend on which section of AutoML are you using but I think that it is roughly the same for all (see AutoML Tables or AutoML Vision).
It is required that all the labels exist in all the three splits. Therefore, even if all the labels are there in the original data, it does not mean that the label will be in all the splits. Given the amount of data you have, this may happen if some labels are uncommon (low ratio).
Solution for this are:
Alternatively, this can be due to some data not properly recognized by AutoML which results in invalid data and those records being discarded which ultimately results in the aforementioned root cause. If that is the case, which can be checked by trying the aforementioned solutions and failing to solve the problem, I recommend reaching GCP support as it is probably an issue in the AutoML feature.