I've been working on this personal project for a while and I'm racking my brain over this one problem.
Every 20 minutes, my cron grabs data from posts in a subreddit. The data includes details like upvote count, upvote ratio, etc. I also have a column for the time that the data was collected so I can find the most current data for each submission.
The current data is displayed on a webpage for that submission, and the historical data for that submission is displayed as charts and graphs on that page, so I need both sets.
Some sample columns for table structure:
reddit_id | retrieval_time | total_upvotes | upvote_ratio | num_comments
The table contains multiple reddit_ids each with their own individual data (total_upvotes, upvote_ratio) at a certain point in time (retrieval_time).
My initial reflex is to do a query for all the reddit_ids, then iterate over that list and run a query for each id like below:
SELECT reddit_id FROM nstats;
Then iterate over the query results and run this query for each one:
SELECT * FROM nstats WHERE reddit_id='9hdajv' ORDER BY retrieval_time DESC;
This method tends to run pretty slow though, at least in my testing. I'm using Postgres with Peewee and Python for database access.
Do I need to restructure the way I keep my data? Am I querying the right way and I just have to deal with the slow speeds? What should a single query look like to grab the most recent data for each submission?
Adding an example output to clarify. This is what the raw table would look like: multiple reddit_ids each with their own data.
reddit_id | retrieval_time | total_upvotes | upvote_ratio | num_comments
9hdajv | 1537471220 | 15 | .78 | 3
9hdajv | 1537298420 | 7 | .98 | 0
9hdajv | 1537384820 | 10 | .86 | 1
7klmnq | 1534706420 | 25 | .86 | 1
7klmnq | 1534965620 | 35 | .75 | 5
3poslo | 1535311220 | 16 | .75 | 3
3poslo | 1535397620 | 20 | .83 | 6
3poslo | 1535570420 | 37 | .85 | 9
This would be the output, 1 reddit_id each with its most current data.
reddit_id | retrieval_time | total_upvotes | upvote_ratio | num_comments
9hdajv | 1537471220 | 15 | .78 | 3
7klmnq | 1534965620 | 35 | .75 | 5
3poslo | 1535570420 | 37 | .85 | 9
This query comes close to the expected output.
Query
SELECT
Table1."reddit_id"
, Table1."retrieval_time"
, Table1."total_upvotes"
, Table1."upvote_ratio"
, Table1."num_comments"
FROM (
SELECT
Table1."reddit_id"
, Table1."retrieval_time"
, Table1."total_upvotes"
, Table1."upvote_ratio"
, Table1."num_comments"
, ROW_NUMBER() OVER (PARTITION BY Table1."reddit_id" ORDER BY Table1."retrieval_time" DESC) AS row_num
FROM
Table1
) AS Table1
WHERE
Table1.row_num = 1
If you add
ORDER BY
Table1.num_comments ASC
the output would be the same. if you care about the order.
Result
| reddit_id | retrieval_time | total_upvotes | upvote_ratio | num_comments |
| --------- | -------------- | ------------- | ------------ | ------------ |
| 3poslo | 1535570420 | 37 | 0.85 | 9 |
| 7klmnq | 1534965620 | 35 | 0.75 | 5 |
| 9hdajv | 1537471220 | 15 | 0.78 | 3 |
What if reddit_id = 9hdajv has multiple records with the same retrieval_time = 1537471220 do they need to be also displayed? – Raymond Nijland
No, just choose one of them if that slips through. That shouldn't happen in my collection code, and if it does, it doesn't matter to me which one is selected – jarcobi889
It also handle ties like requested.