Search code examples
data-sciencegoogle-cloud-automl

Does Google's AutoML Table shuffle my data samples before training/evaluation?


I sought through the documentation but still have no clue whether or not the service shuffles data before training/evaluation. I need to know this because by data is time-series which would be realistic to evaluate a trained model on samples of earlier period of time.

Can someone please let me know the answer or guide me how to figure this out? I know that I can export evaluation result and tweak on it but BigQuery seems to not respect the order of original data and there's no absolute time feature in the data.


Solution

  • It doesn't shuffle but split it.

    Take a look here: About controlling data split. It says:

    By default, AutoML Tables randomly selects 80% of your data rows for training, 10% for validation, and 10% for testing.

    If your data is time-sensitive, you should use the Time column.

    By using it, AutoML Tables will use the earliest 80% of the rows for training, the next 10% of rows for validation, and the latest 10% of rows for testing.