I believe I am missing something (probably quite simple) in the use of GROUP BY in BigQuery, and I am hoping someone can set me straight.
Comparing these two queries I get different numbers of users
SELECT SUM(users) FROM (
SELECT
DATE,
COUNT(DISTINCT user_id) AS users,
FROM
`mytable`
WHERE
DATE BETWEEN ('2022-05-01') AND ('2022-05-31')
GROUP BY
DATE
)
value for users approx: 140000
SELECT
COUNT(DISTINCT user_id) AS users,
FROM
`mytable`
WHERE
DATE BETWEEN ('2022-05-01') AND ('2022-05-31')
value for users approx: 120000
In the second query you're counting the distinct user_id values in the entire date range. In the first query you're counting the distinct user_id values for each day in the range, then summing those. There are probably duplicate users being counted on different days in the first query.