Search code examples
sqlhashgoogle-bigquerycross-validation

How to do repeatable sampling in BigQuery Standard SQL?


In this blog a Google Cloud employee explains how to do repeatable sampling of data sets for machine learning in BigQuery. This is very important for creating (and replicating) train/validation/test partitions of your data.

However the blog uses Legacy SQL, which Google has now deprecated in favor of Standard SQL.

How would you re-write the blog's sampling code shown below, but using Standard SQL?

#legacySQL
SELECT
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  [bigquery-samples:airline_ontime_data.flights]
WHERE
  ABS(HASH(date)) % 10 < 8

Solution

  • Standard SQL would re-write the query thus:

    #standardSQL
    SELECT
      date,
      airline,
      departure_airport,
      departure_schedule,
      arrival_airport,
      arrival_delay
    FROM
      `bigquery-samples.airline_ontime_data.flights`
    WHERE
      ABS(MOD(FARM_FINGERPRINT(date), 10)) < 8
    

    Specifically here are the changes:

    • a period (not colon) to separate the Google Cloud project from table name.
    • backticks (not square brackets) to escape hyphen in the table name.
    • MOD function (not %).
    • FARM_FINGERPRINT (not HASH). This is actually a different hashing function than Legacy SQL's HASH, which wasn't in fact consistent over time as the blog had implied.