Search code examples
sqlstored-proceduresgoogle-bigqueryparameterstemp-tables

BigQuery stored procedure: unexpected identifier error when using declared variable in LIMIT clause


I'm using BigQuery and trying to use declared variables in my stored procedure. In the snippet below, I'm able to use _error_rate inside my CTEs, but I am not able to use _total_records at the bottom to limit the size of the temporary table.

The error I get from BigQuery is:

Syntax error: Unexpected identifier "_total_records" at [19:16]

CREATE OR REPLACE PROCEDURE my_dataset.my_procedure()
BEGIN
    DECLARE _error_rate DECIMAL DEFAULT 0.05;
    DECLARE _total_records INT64 DEFAULT 100;

    CREATE TEMPORARY TABLE temp_final_rows_1 AS
        (WITH cte_final_rows AS
                  (SELECT CASE
                              WHEN RAND() < _error_rate
                                  THEN FLOOR(RAND() * 1000000000)
                              WHEN RAND() < _error_rate THEN NULL
                              ELSE `NPI` END AS `NPI`,
                          -- etc.
                   FROM my_dataset.v_base_rows)

         SELECT *
         FROM cte_final_rows
         ORDER BY RAND()
         LIMIT _total_records); -- Syntax error: Unexpected identifier "_total_records"

    -- etc.
END;

I tried replacing _total_records with a literal value 100 and it worked. so I know it's okay with _error_rate in the CTE because the procedure works as expected.

I can't find anywhere in the BigQuery documentation that explains why _total_records would be out of scope here. I expect either both variables to be in scope, or both out of scope. I'm very confused why one is accessible and the other is not. Is this a BigQuery bug or is this a quirk of all SQL dialects? I'm having trouble googling for solutions or explanations. Any help is appreciated.

Previously, I had the CREATE TEMPORARY TABLE statement inside an EXECUTE IMMEDIATE """ ... """ string and had no trouble injecting _total_records by using string concatenation. I really don't want to do that though because it's very messy and my IDE doesn't offer any help with queries as big strings...


Solution

  • BigQuery's docs say

    Query parameters can be used as substitutes for arbitrary expressions. Parameters cannot be used as substitutes for identifiers, column names, table names, or other parts of the query.

    It is not clear what the "other parts of the query" are. Apparently LIMIT is in that group. We found a workaround that uses ROW_NUMBER() and a WHERE instead of LIMIT.

    CREATE OR REPLACE PROCEDURE my_dataset.my_procedure()
    BEGIN
        DECLARE _error_rate DECIMAL DEFAULT 0.05;
        DECLARE _total_records INT64 DEFAULT 100;
    
        CREATE TEMPORARY TABLE temp_final_rows_1 AS
            (WITH cte_final_rows AS
                      (SELECT ROW_NUMBER() OVER(ORDER BY RAND()) as row_num,
                              CASE
                                  WHEN RAND() < _error_rate
                                      THEN FLOOR(RAND() * 1000000000)
                                  WHEN RAND() < _error_rate THEN NULL
                                  ELSE `NPI` END AS `NPI`,
                              -- etc.
                       FROM my_dataset.v_base_rows)
    
             SELECT * EXCEPT (row_num)
             FROM cte_final_rows
             WHERE row_num <= _total_records); 
    
        -- etc.
    END;