Search code examples
sqlgoogle-bigquerypartitioningrankingin-operator

Google Bigquery inconsistent when variable names changes in ORDER BY clause


My goal is to test if the grp's generated by one query, are the same grp's as the output of the same query. However, when I change a single variable name, I get different results.

Below I show an example of the same query where we know the results are the same. However, if you run this group, you will find one query produces different results than another.

SELECT grp
FROM
(
  SELECT CONCAT(word, corpus) AS grp, rank1, rank2 
  FROM (
    SELECT
      word, corpus,
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY test1 DESC) AS rank1,
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
    FROM 
    (
      SELECT *, (word_count * word_count * corpus_date) AS test1
      FROM [bigquery-public-data:samples.shakespeare]
    )
  )
)
WHERE rank1 <= 3 OR rank2 <= 3
HAVING grp NOT IN 
(
  SELECT grp FROM (
    SELECT CONCAT(word, corpus) AS grp, rank1, rank2
    FROM
    (
      SELECT
        word, corpus,
        ROW_NUMBER() OVER (PARTITION BY word ORDER BY test2 DESC) AS rank1,
        ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
        ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
        ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
      FROM 
      (
        SELECT *, (word_count * word_count * corpus_date) AS test2
        FROM [bigquery-public-data:samples.shakespeare]
      )
    )
  )
  WHERE rank1 <= 3 OR rank2 <= 3
)

Far worse... now if you try running the exact same query, but simply change the variable name test1 to test3, you will get completely different results.

SELECT grp
FROM
(
  SELECT CONCAT(word, corpus) AS grp, rank1, rank2 
  FROM (
    SELECT
      word, corpus,
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY test3 DESC) AS rank1,
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
    FROM 
    (
      SELECT *, (word_count * word_count * corpus_date) AS test3
      FROM [bigquery-public-data:samples.shakespeare]
    )
  )
)
WHERE rank1 <= 3 OR rank2 <= 3
HAVING grp NOT IN 
(
  SELECT grp FROM (
    SELECT CONCAT(word, corpus) AS grp, rank1, rank2
    FROM
    (
      SELECT
        word, corpus,
        ROW_NUMBER() OVER (PARTITION BY word ORDER BY test2 DESC) AS rank1,
        ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
        ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
        ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
      FROM 
      (
        SELECT *, (word_count * word_count * corpus_date) AS test2
        FROM [bigquery-public-data:samples.shakespeare]
      )
    )
  )
  WHERE rank1 <= 3 OR rank2 <= 3
)

I can think of no explanation that satisfies both of these bizarre behaviors and this is preventing me from being able to validate my data. Any ideas?

EDIT:

I've updated the BigQuery SQL in the way the responses would suggest, and the same inconsistencies occur.


Solution

  • The problem is nondeterminism in your row numbering.

    There are many examples in this table where (word_count * word_count * corpus_date) is the same for several corpuses. So when you partition by word and order by test2, the ordering you use for assigning row numbers is nondeterministic.

    When you run the same subquery twice within the same top-level query, BigQuery actually executes that subquery twice and may yield different results between the two runs due to that nondeterminism.

    Changing the alias might have just caused your query to not hit in the cache, resulting in a different set of nondeterministic choices and different amount of overlap between the results.

    You can confirm this by changing the ORDER BY clause in your analytic functions to include corpus. For example, change ORDER BY test2 to ORDER BY test2, corpus. Then the row numbering will be deterministic, and the queries will return zero results regardless of what aliases you use.