Search code examples
google-bigquerybigtablegoogle-cloud-bigtable

Use case of using Big Query or Big table for querying aggregate values?


I have usecase for designing storage for 30 TB of text files as part of deploying data pipeline on Google cloud. My input data is in CSV format, and I want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines. Which would be a better option in below for this use case?

Using Cloud Storage for storage and link permanent tables in Big Query for query or Using Cloud Big table for storage and installing HBaseShell on compute engine to query Big table data.

Based on my analysis in below for this specific usecase, I see below where cloudstorage can be queried in through BigQuery. Also, Bigtable supports CSV imports and querying. BigQuery limits also mention a maximum size per load job of 15 TB across all input files for CSV, JSON, and Avro based on the documentation, which means i could load mutiple load jobs if loading more than 15 TB, i assume.

https://cloud.google.com/bigquery/external-data-cloud-storage#temporary-tables  
https://cloud.google.com/community/tutorials/cbt-import-csv  
https://cloud.google.com/bigquery/quotas  

So, does that mean I can use BigQuery for the above usecase?


Solution

  • The short answer is yes.

    I wrote about this in:

    And when loading cluster your tables, for massive improvements in costs for the most common queries:

    In summary:

    • BigQuery can read CSVs and other files straight from GCS.
    • You can define a view that parses those CSVs in any way you might prefer, all within SQL.
    • You can run a CREATE TABLE statement to materialize the CSVs into BigQuery native tables for better performance and costs.
    • Instead of CREATE TABLE you can do imports via API, those are free (instead of cost of query for CREATE TABLE.

    15 TB can be handled easily by BigQuery.