I'm currently stuck on figuring out how to efficiently calculate the number of active users on any given "page" on an app. I'm using PostgreSQL and have a table like this:
CREATE TABLE user_is_viewing_page (
user_id BIGINT,
page_id BIGINT,
timestamp TIMESTAMP
);
Each user is going to POST to a server every 10 seconds or so with the id of the page they are viewing and a new row will be inserted into the database.
The current query I have to count the number of "active" users on any given page is:
SELECT COUNT(DISTINCT user_id)
FROM user_is_viewing_page WHERE page_id = 1
AND timestamp > CURRENT_TIMESTAMP - INTERVAL '10 seconds';
I'm wondering what the most efficient way to use this query is. Keep in mind that I need to access this count fairly often (every 5-10 seconds per page).
A materialized view is not going to help, since your query needs to be based on the latest data and you would have to refresh the MV just as often as you query.
Trigger-based solutions would be another option: keep an auxiliary table up to date with current counts per page. But I would expect the additional cost on (your many) write operations to be much higher than the gain for the read operation. So I would rule this out, too.
While you operate with one big table I suggest a partial index:
CREATE INDEX foo ON user_is_viewing_page (page_id, timestamp)
WHERE timestamp > '2014-12-29 23:30:00'::timestamp; -- start with 'now'
Query (mostly what you had already):
SELECT COUNT(DISTINCT user_id)
FROM user_is_viewing_page
WHERE page_id = 1
AND timestamp > LOCALTIMESTAMP - INTERVAL '10 sec';
CURRENT_TIMESTAMP
would work, too. But LOCALTIMESTAMP
makes more sense for your setup. Per documentation:
CURRENT_TIME
andCURRENT_TIMESTAMP
deliver values with time zone;LOCALTIME
andLOCALTIMESTAMP
deliver values without time zone.
The look-up on the partial index itself has basically the same cost as using a full index. But since your table is supposedly big, the partial index should be much smaller than a full index, which will much rather fit and stay in RAM and be generally faster. If you have more than enough RAM, compare performance with a simple, big full index without WHERE
condition.
The advantage of the partial index obviously deteriorates over time. Create a new index with an updated timestamp in the WHERE
condition at intervals of your choosing and drop the old index after that. Queries will start the new (smaller) index immediately, so the old one can easily be dropped. Possible ways to automate are outlined in these related answers with more explanation:
You may need to add the exact WHERE
condition of the index to your queries (though seemingly redundant) to convince the query planner it's safe to use the partial index. Especially with prepared statements (including all statements in plpgsql functions) where the actual timestamp to compare to is parametrized, else Postgres cannot use the partial index for a generic query plan.
In the example above you would add the WHERE
condition to the query:
AND timestamp > '2014-12-29 23:30:00'::timestamp -- matches index condition exactly
A more general solution for this can be found in the linked answer above.
Aside: I wouldn't use "timestamp" as identifier because it's a basic type name.