google-bigquery google-cloud-functions google-cloud-run

Increase BigQuery batch size when calling remote function

I am using GCloud's remote functions via BigQuery to call my API for each row in a table.

Ultimately, these requests hit my API with a single row of data, which is way too small.

I see that the size limit for a request is 10MB - source, so I can't figure out why BigQuery is deciding that each batch should contain only a single row.

Google's docs reference a way to limit the maximum number of rows in an HTTP requeset, but I can't see any config to increase the minimum.

You can specify max_batching_rows in OPTIONS as the maximum number of rows in each HTTP request, to avoid Cloud Functions timeout. If not specified, BigQuery will decide how many rows are included in a batch.

Is there any config I can change to get BigQuery to use more than a single row per batch when calling a remote function?

Solution

I don't know if this will solve your particular issue, but I have personally noticed similar behavior with BigQuery. In particular, I have noticed that a BigQuery query of the format

CREATE 
OR REPLACE TABLE project.dataset.table AS (
  SELECT 
    STRING(
      JSON_EXTRACT(
        project.dataset.remote_fn_example( 
          TO_JSON(
            ARRAY[TO_JSON(Example) ]
          )
        ), 
        "$[0]"
      )
    ) AS example_processed 
  FROM 
    (
      SELECT 
        * 
      FROM 
        project.dataset.other_table 
      ORDER BY 
        id
    )
)

BigQuery will send batches of rows.

However, if you omit the ORDER BY

CREATE 
OR REPLACE TABLE project.dataset.table AS (
  SELECT 
    STRING(
      JSON_EXTRACT(
        project.dataset.remote_fn_example( 
          TO_JSON(
            ARRAY[TO_JSON(Example) ]
          )
        ), 
        "$[0]"
      )
    ) AS example_processed 
  FROM 
    (
      SELECT 
        * 
      FROM 
        project.dataset.other_table 
    )
)

BigQuery seems to just send one row at a time.

I'm not sure what causes this behavior, but adding an ORDER BY clause seems to mitigate the issue for me at time of writing (I went with ORDER BY RAND()).

Of course, as the the docs you linked to mention, BigQuery ultimately gets to choose how many rows they will send up to that max_batching_rows, so there is no guarantees that this workaround will continue to work, or even that this workaround will work for you. But, posting anyways in case it is helpful to someone else.