Search code examples
sqlgoogle-cloud-platformgoogle-bigquerysubquery

Why does having a subquery inside of an IN clause triggers a full scan while using an array doesn't?


I have two BigQuery tables A (~100mb) and B (~69gb). Table A is partitioned by a date field and has multiple fields. Table B is not partitioned and only has one date field. I want to delete all records on table A that are within the partition dates on table B.

Using a subquery inside my delete where clause triggers a full scan of table B:

DELETE
FROM `table_B`
WHERE cdate IN (SELECT DISTINCT (cdate) FROM `table_A`)

In my BigQuery case, this processes ~69gb: https://i.sstatic.net/nvLQD.png

However, if I write all dates as an array instead of using the subquery, then it only reads the specified partitions:

DELETE
FROM `table_B`
WHERE cdate IN ("2022-04-01","2022-04-02", "2022-04-03", ...)

In my BigQuery case, this processes ~12.5gb: https://i.sstatic.net/xCSzu.png

I understand the subquery having the overhead of fetching the records from table A, but the 69gb of processing leads me to believe a full scan is happening on table B.

I've tried a couple of different things such as using a temporary table and a WITH statement, but same result. I don't want to resort to writing a procedural statement and manually constructing the array, that would be quite cumbersome, but I'm running out of options.

How is the the subquery evaluated? Is there a different way to accomplish what's happening on the second query that I'm missing?


Solution

  • Okay, so, as expected, writing a dynamic query with BQ procedural language does work:

    EXECUTE IMMEDIATE FORMAT("""
      SELECT * FROM `table_B` WHERE cdate IN ( '%s' )
    """, (SELECT STRING_AGG(DISTINCT CAST(cdate AS STRING), "','") FROM `table_A`))
    

    This in my BigQuery case processes 2kb to build the statement, and 12.5gb for the actual query, the same processing as typing in the dates in an array. This solves my issue, but I'd still like to understand how BigQuery evaluates the subquery, and if this is actually the only way to do it, as I find it far from elegant.