Search code examples
postgresqlpython-3.5peewee

Querying Current Data from Historical Data


I've been working on this personal project for a while and I'm racking my brain over this one problem.

Every 20 minutes, my cron grabs data from posts in a subreddit. The data includes details like upvote count, upvote ratio, etc. I also have a column for the time that the data was collected so I can find the most current data for each submission.

The current data is displayed on a webpage for that submission, and the historical data for that submission is displayed as charts and graphs on that page, so I need both sets.

Some sample columns for table structure:

reddit_id | retrieval_time | total_upvotes | upvote_ratio | num_comments

The table contains multiple reddit_ids each with their own individual data (total_upvotes, upvote_ratio) at a certain point in time (retrieval_time).

My initial reflex is to do a query for all the reddit_ids, then iterate over that list and run a query for each id like below:

SELECT reddit_id FROM nstats;

Then iterate over the query results and run this query for each one:

SELECT * FROM nstats WHERE reddit_id='9hdajv' ORDER BY retrieval_time DESC;

This method tends to run pretty slow though, at least in my testing. I'm using Postgres with Peewee and Python for database access.

Do I need to restructure the way I keep my data? Am I querying the right way and I just have to deal with the slow speeds? What should a single query look like to grab the most recent data for each submission?

Adding an example output to clarify. This is what the raw table would look like: multiple reddit_ids each with their own data.

reddit_id | retrieval_time | total_upvotes | upvote_ratio | num_comments
9hdajv | 1537471220 | 15 | .78 | 3
9hdajv | 1537298420 | 7 | .98 | 0
9hdajv | 1537384820 | 10 | .86 | 1
7klmnq | 1534706420 | 25 | .86 | 1
7klmnq | 1534965620 | 35 | .75 | 5
3poslo | 1535311220 | 16 | .75 | 3
3poslo | 1535397620 | 20 | .83 | 6
3poslo | 1535570420 | 37 | .85 | 9

This would be the output, 1 reddit_id each with its most current data.

reddit_id | retrieval_time | total_upvotes | upvote_ratio | num_comments
9hdajv | 1537471220 | 15 | .78 | 3
7klmnq | 1534965620 | 35 | .75 | 5
3poslo | 1535570420 | 37 | .85 | 9

Solution

  • This query comes close to the expected output.

    Query

    SELECT 
       Table1."reddit_id"
     , Table1."retrieval_time"
     , Table1."total_upvotes"
     , Table1."upvote_ratio"
     , Table1."num_comments" 
    FROM (
    
      SELECT 
         Table1."reddit_id"
       , Table1."retrieval_time"
       , Table1."total_upvotes"
       , Table1."upvote_ratio"
       , Table1."num_comments" 
       , ROW_NUMBER() OVER (PARTITION BY  Table1."reddit_id" ORDER BY Table1."retrieval_time" DESC) AS row_num
      FROM 
       Table1
    
    ) AS Table1
    WHERE
     Table1.row_num = 1
    

    If you add

    ORDER BY Table1.num_comments ASC

    the output would be the same. if you care about the order.

    Result

    | reddit_id | retrieval_time | total_upvotes | upvote_ratio | num_comments |
    | --------- | -------------- | ------------- | ------------ | ------------ |
    | 3poslo    | 1535570420     | 37            | 0.85         | 9            |
    | 7klmnq    | 1534965620     | 35            | 0.75         | 5            |
    | 9hdajv    | 1537471220     | 15            | 0.78         | 3            |
    

    Demo on DB Fiddle

    What if reddit_id = 9hdajv has multiple records with the same retrieval_time = 1537471220 do they need to be also displayed? – Raymond Nijland

    No, just choose one of them if that slips through. That shouldn't happen in my collection code, and if it does, it doesn't matter to me which one is selected – jarcobi889

    It also handle ties like requested.

    Demo on DB Fiddle