Currently I using following query:
SELECT
ID,
Key
FROM
mydataset.mytable
where ID = 100077113
and Key='06019'
ID - unique
If I know the key looking for ID can be done on ~10,000 rows and work much faster and process much less data.
How can I use the new clustering capabilites in BigQuery to partition on the field Key?
(I'm going to summarize and expand on what Mikhail, Pentium10, and Pavan said)
I have a table with 12M rows and 76 GB of data. This table has no timestamp column.
This is how to cluster said table - while creating a fake date column for fake partitioning:
CREATE TABLE `fh-bigquery.public_dump.github_java_clustered`
(id STRING, size INT64, content STRING, binary BOOL
, copies INT64, sample_repo_name STRING, sample_path STRING
, fake_date DATE)
PARTITION BY fake_date
CLUSTER BY id AS (
SELECT *, DATE('1980-01-01') fake_date
FROM `fh-bigquery.github_extracts.contents_java`
)
Did it work?
# original table
SELECT *
FROM `fh-bigquery.github_extracts.contents_java`
WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3'
(3.3s elapsed, 72.1 GB processed)
# clustered table
SELECT *
FROM `fh-bigquery.public_dump.github_java_clustered2`
WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3'
(2.4s elapsed, 232 MB processed)
What I learned here:
Read more: https://medium.com/@hoffa/bigquery-optimized-cluster-your-tables-65e2f684594b