I have usecase for designing storage for 30 TB of text files as part of deploying data pipeline on Google cloud. My input data is in CSV format, and I want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines. Which would be a better option in below for this use case?
Using Cloud Storage for storage and link permanent tables in Big Query for query or Using Cloud Big table for storage and installing HBaseShell on compute engine to query Big table data.
Based on my analysis in below for this specific usecase, I see below where cloudstorage can be queried in through BigQuery. Also, Bigtable supports CSV imports and querying. BigQuery limits also mention a maximum size per load job of 15 TB across all input files for CSV, JSON, and Avro based on the documentation, which means i could load mutiple load jobs if loading more than 15 TB, i assume.
https://cloud.google.com/bigquery/external-data-cloud-storage#temporary-tables
https://cloud.google.com/community/tutorials/cbt-import-csv
https://cloud.google.com/bigquery/quotas
So, does that mean I can use BigQuery for the above usecase?
The short answer is yes.
I wrote about this in:
And when loading cluster your tables, for massive improvements in costs for the most common queries:
In summary:
CREATE TABLE
statement to materialize the CSVs into BigQuery native tables for better performance and costs.CREATE TABLE
you can do imports via API, those are free (instead of cost of query for CREATE TABLE
.15 TB can be handled easily by BigQuery.