Search code examples
google-bigquerypartitionclustered-index

My data can’t be date partitioned, how do I use clustering?


Currently I using following query: SELECT ID, Key FROM mydataset.mytable where ID = 100077113 
and Key='06019'

My data has 100 million rows:

ID - unique

Key - can have ~10,000 keys

If I know the key looking for ID can be done on ~10,000 rows and work much faster and process much less data.

How can I use the new clustering capabilites in BigQuery to partition on the field Key?


Solution

  • (I'm going to summarize and expand on what Mikhail, Pentium10, and Pavan said)

    I have a table with 12M rows and 76 GB of data. This table has no timestamp column.

    This is how to cluster said table - while creating a fake date column for fake partitioning:

    CREATE TABLE `fh-bigquery.public_dump.github_java_clustered` 
      (id STRING, size INT64, content STRING, binary BOOL
       , copies INT64, sample_repo_name STRING, sample_path STRING
       , fake_date DATE) 
    PARTITION BY fake_date 
    CLUSTER BY id AS (
      SELECT *, DATE('1980-01-01') fake_date 
      FROM `fh-bigquery.github_extracts.contents_java`
    )
    

    Did it work?

    # original table
    
    SELECT *
    FROM `fh-bigquery.github_extracts.contents_java`
    WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3'
    
    (3.3s elapsed, 72.1 GB processed)
    
    
    # clustered table
    
    SELECT *
    FROM `fh-bigquery.public_dump.github_java_clustered2`
    WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3'
    (2.4s elapsed, 232 MB processed)
    

    What I learned here:

    • Clustering can work with unique ids, even for tables without a date to partition by.
    • Prefer using a fake date instead of a null date (but only for now - this should be improved).
    • Clustering made my query 99.6% cheaper when looking for rows by id!

    Read more: https://medium.com/@hoffa/bigquery-optimized-cluster-your-tables-65e2f684594b