Search code examples
google-bigquerybigdatakaggle

Bigquery maximum processing data size allowance?


My question is how much data are we allowed to process on bigquery. I am using stackoverflow's kaggle dataset to analyze the data, and the text I am analyzing is around 27gb. I just want to get the average length per entry, so I do

query_length_text = """
    SELECT 
        AVG(CHAR_LENGTH(title)) AS avg_title_length,
        AVG(CHAR_LENGTH(body)) AS avg_body_length
    FROM
        `bigquery-public-data.stackoverflow.stackoverflow_posts`
"""

however this says:

Query cancelled; estimated size of 26.847077486105263 exceeds limit of 1 GB

I am only returning one float, so i know that isn't the problem. Is the 1gb on the processing too? How do I process it in batches, so I do 1gb at a time?


Solution

  • So Kaggle by default sets a 1GB limit on requests (to prevent your monthly quota of 5TB to run out). This is what causes this to happen. To prevent this, you can override it by using the max_gb_scanned parameter like this:

    df = bq_assistant.query_to_pandas_safe(QUERY, max_gb_scanned = N)
    

    where N is the amount of data processed by your query, or any number higher than it.