In this blog a Google Cloud employee explains how to do repeatable sampling of data sets for machine learning in BigQuery. This is very important for creating (and replicating) train/validation/test partitions of your data.
However the blog uses Legacy SQL, which Google has now deprecated in favor of Standard SQL.
How would you re-write the blog's sampling code shown below, but using Standard SQL?
#legacySQL
SELECT
date,
airline,
departure_airport,
departure_schedule,
arrival_airport,
arrival_delay
FROM
[bigquery-samples:airline_ontime_data.flights]
WHERE
ABS(HASH(date)) % 10 < 8
Standard SQL would re-write the query thus:
#standardSQL
SELECT
date,
airline,
departure_airport,
departure_schedule,
arrival_airport,
arrival_delay
FROM
`bigquery-samples.airline_ontime_data.flights`
WHERE
ABS(MOD(FARM_FINGERPRINT(date), 10)) < 8
Specifically here are the changes:
MOD
function (not %
).FARM_FINGERPRINT
(not HASH
). This is actually a different hashing function than Legacy SQL's HASH
, which wasn't in fact consistent over time as the blog had implied.