Please note: this question is about PostgreSQL and pgAdmin4, so it is not a duplicate of Using COUNT returns different results, which is about MySQL.
I have a ~4,000,000 records table. The following SQL query:
SELECT COUNT(*) FROM (SELECT * FROM log LIMIT 40000)a WHERE some_column = true
returns a different value each time I execute it. What is the problem and how could I fix it?
This is your query:
SELECT COUNT(*)
FROM (SELECT * FROM log LIMIT 40000)a
WHERE some_column = true
The subquery is returning an arbitrary set of 40,000 rows. This set is different each time you execute the query. If you want a canonical set, you need ORDER BY
with unique sorting keys.
You are filtering afterwards, so the number that match your condition is arbitrary.
If you want an arbitrary set of 40,000 rows where the condition is true, simply do:
SELECT COUNT(*)
FROM (SELECT l.*
FROM log
WHERE some_column = true
LIMIT 40000
) ;
If you want a random set of 40,000 rows, you can do:
SELECT COUNT(*)
FROM (SELECT l.*
FROM log
WHERE some_column = true
ORDER BY random()
LIMIT 40000
) ;
Arbitrary <> random. (Note: this is expensive; there are cheaper ways to get a random set.).
If you want a repeatable sample, well, there are various methods to do that. One is to sort by something like a unique id and take the first 40,000 rows:
SELECT COUNT(*)
FROM (SELECT l.*
FROM log l
WHERE some_column = true
ORDER BY primary_key
LIMIT 40000
) ;
This is not a random sample, just a repeatable sample.