Search code examples
google-bigquerychunking

Chunking BigQuery results


Hi I'm using the BigQuery python library and I'm having some trouble understanding how the RowIterator class works. Given a query job_id I would like to fetch the result rows in chunks of fixed size.

I first run the query, using the client.

query_job = client.query(query, job_config=job_config)

Later I fetch the job like this -

job = client.get_job(query_job.job_id)
result_iter = job.result()

I would like to know how job.result() gets the results. Does it fetch results one row at a time from the bigquery result table or are the entire results stored in memory. Also, how or where do I specify that I would like to fetch results in fixed chunksize and get the next page token?

Thanks!


Solution

  • Looking at the RowTerator code, it will attempt to read all results using one GET request and then iterate through them in memory.

    There is no way to set max_results in job.result() so it will be limited by BigQuery default limits of 100.000 rows [1] and 10MB total size. If results don't fit in this limit, the iterator will try to read next page.

    list_rows API [2] allows you to set max_results and page_token. If your goal is to read all job results in pages of defined size, you can probably use this api to create a separate RowTerator for each page. You can get the next page token from the iterator [3]. Hope this helps.

    [1] https://cloud.google.com/bigquery/docs/paging-results

    [2] https://github.com/GoogleCloudPlatform/google-cloud-python/blob/16b0f8eb9adacf47127521f93c13380a332268b1/bigquery/google/cloud/bigquery/client.py#L1218

    [3] https://github.com/GoogleCloudPlatform/google-cloud-python/blob/b0dc7bc17d8bb3ab56928f28096846c55cb79893/api_core/google/api_core/page_iterator.py#L340