Search code examples
google-bigquerygoogle-cloud-stackdrivergoogle-cloud-load-balancer

How to query Google Cloud Load Balancer requests sinked by stackdrive via BigQuery by timestamp partition without load all data?


I'm trying to query my dataset by timestamp but i'm billed for full dataset scan. Is it possible to query only for specific timestamp and reduce the size of data analyzed?

Note that i'm using Stackdrive Sink that stream data real time on my dataset. I don't have any control on the schema.

SELECT
  REGEXP_EXTRACT(httpRequest.requestUrl, r"^.+cid=([0-9]*)") as cid,
  REGEXP_EXTRACT(httpRequest.requestUrl, r"^.+pid=([0-9]*)") as pid
FROM
   `dataset_20190305` -- Date partition - no time
WHERE timestamp >=TIMESTAMP('2019-03-05 16:50:00')

Solution

  • There are only two ways to reduce BQ query cost using only the where clause: partitioning and clustering. Judging by your dataset name, I suspect that your dataset only contains data relevant to March 5th. Unfortunately, BQ only supports day partitioning, so it won't help your case. There is no way to partition by any time block smaller than day. That being said, it may make your data cleaner if you simply create one table and partition it by day instead of creating a new dataset each day.

    The other option, clustering, involves specifying a field and providing all potential values. For example, if you had a field "Color" of type String, you could set potential values as "Red", "Green", and "Blue", and therefore, reduce query size to a third.

    If you're just looking to reduce cost and find yourself repeating a similar query on a specific table a lot, you can query it once and copy the results to a new table, which would essentially limit your query to a specified time range.

    That being said, in general, try to follow BQ's recommendations, such as only selecting columns that you're interested in.