sql hash google-bigquery cross-validation

How to do repeatable sampling in BigQuery Standard SQL?

In this blog a Google Cloud employee explains how to do repeatable sampling of data sets for machine learning in BigQuery. This is very important for creating (and replicating) train/validation/test partitions of your data.

However the blog uses Legacy SQL, which Google has now deprecated in favor of Standard SQL.

How would you re-write the blog's sampling code shown below, but using Standard SQL?

#legacySQL
SELECT
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  [bigquery-samples:airline_ontime_data.flights]
WHERE
  ABS(HASH(date)) % 10 < 8

Solution

Standard SQL would re-write the query thus:

#standardSQL
SELECT
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  `bigquery-samples.airline_ontime_data.flights`
WHERE
  ABS(MOD(FARM_FINGERPRINT(date), 10)) < 8

Specifically here are the changes:

a period (not colon) to separate the Google Cloud project from table name.
backticks (not square brackets) to escape hyphen in the table name.
MOD function (not %).
FARM_FINGERPRINT (not HASH). This is actually a different hashing function than Legacy SQL's HASH, which wasn't in fact consistent over time as the blog had implied.