Search code examples
sqlrandomgoogle-bigqueryansi-sql

In BigQuery, how to random split query results?


In BigQuery (standard SQL), I would like to randomly split result lines from a query.

  • The split needs to be consistent and should not change over time: meaning that every time the query is ran, the split should remain the same.

  • In particular, if data is added to the source table, data previously in one side of the split should remain in the same split.

  • The strategy should be able to handle different splitting ratios

I currently have, for instance, for a table mytable, with columns (order_id, created_at, country):

for a 10%/90% split, on the one hand

SELECT 
   *
FROM
   `mytable`
WHERE RAND() <= 10/90

on the other hand

SELECT 
   *
FROM
   `mytable`
WHERE RAND() > 10/90

But this yields an inconsistent split. Is there a way to achieve this properly?


Solution

  • The solution is to use a hashing function on one column that discriminates uniquely each line of your source table (for instance here orderId).

    BigQuery has a hashing function which target type is a signed INT64 (and source STRING or BYTES): FARM_FINGERPRINT (from here).

    A solution is to encode each line according to its FARM_FINGERPRINT(orderId) which is a sample from a uniform distribution of all INT64 numbers.

    For a given k, MOD(ABS(FARM_FINGERPRINT(orderId)),k) constitutes a uniform distribution of integers in [0,k-1] (Be wary that MOD(a,b) with a negative and b positive may return a negative number).

    Thus, assume you want a 10%/90% split.

    The queries for each of the splits would look like:

    SELECT 
       *
    FROM
       `mytable`
    WHERE MOD(ABS(FARM_FINGERPRINT(orderId)),10) = 0
    

    and

    SELECT 
       *
    FROM
       `mytable`
    WHERE MOD(ABS(FARM_FINGERPRINT(orderId)),10) != 0
    

    It can be generalised to any 1/k split:

    SELECT 
       *
    FROM
       `mytable`
    WHERE MOD(ABS(FARM_FINGERPRINT(orderId)),@k) = 0
    

    and

    SELECT 
       *
    FROM
       `mytable`
    WHERE MOD(ABS(FARM_FINGERPRINT(orderId)),@k) != 0
    

    Changing the value of the discriminative mod integer (0 in the above) allows to have up to k different splits (which can be very useful if you are doing multiple-fold cross validation in ML for instance).

    Last but not least:

    • the folds are robust to data addition in the table
    • the folds remain unchanged, whenever the queries are ran