Search code examples
sqlgroup-bygoogle-bigquery

Why does GROUP BY give me too high a row count in BigQuery


I believe I am missing something (probably quite simple) in the use of GROUP BY in BigQuery, and I am hoping someone can set me straight.

Comparing these two queries I get different numbers of users

SELECT SUM(users) FROM (
    SELECT
        DATE,
        COUNT(DISTINCT user_id) AS users,
      FROM
        `mytable`
      WHERE
        DATE BETWEEN ('2022-05-01') AND ('2022-05-31')        
      GROUP BY
        DATE
)

value for users approx: 140000

SELECT
   COUNT(DISTINCT user_id) AS users,
    FROM
      `mytable`
    WHERE
      DATE BETWEEN ('2022-05-01') AND ('2022-05-31')

value for users approx: 120000


Solution

  • In the second query you're counting the distinct user_id values in the entire date range. In the first query you're counting the distinct user_id values for each day in the range, then summing those. There are probably duplicate users being counted on different days in the first query.