Search code examples
sqlduplicatesamazon-redshiftrankingrank

How do I get the first row based on three columns, an id, document number and a timestamp, during a specific time?


I'm trying to find a solution for a case at work that basically is:

We have a database that registers all the api requests and responses. We need to match the number of requests that are charged by the api owner and the number on the database.

We have an id column, but it's not unique, a document number column and the timestamp of when the request was made.

The tricky part is: If we make a request for an specific document, it'll charge for that request, but in the period of 24 hours, if we make the same request for that document, it won't be charged. After that 24h period from the first request, if we make a new request, it'll count as the "first one" again will be charged, so another 24h period starts again and new requests for that document won't be charged during that window frame.

Example:

Id Document Number Timestamp
id1 doc1 2024-01-01 00:00:00
id2 doc2 2024-01-01 00:00:00
id3 doc1 2024-01-01 04:30:00
id4 doc1 2024-01-02 01:00:00

In the example, doc 1 was called 3 times. The first one (id1) will be charged and starts the 24h window span, the second (id2) will also be charge since the doc is not the same. The third one (id3) WON'T be charged since it's still on the window span, but the fourth (id4) will be charged again since it's after the 24h from id1.

I'm trying to think on how to do that, maybe using ranking to sort it by date and document, but since we have that 24h rule, I've not been able to find a solution.


Solution

  • Here's the solution I came up with:

    1. Assign the records a Row Number based on the order of the DOC_NUM, REQUEST_DATETIME and Id columns.
    2. Use a RECURSIVE CTE to calculate the hours between each document request and add those hours to a running total (RS_HOURS_LAST_REQUEST column).
    3. If the RS_HOURS_LAST_REQUEST exceeds 24 hours then reset the running total back to zero.

    The RS_HOURS_LAST_REQUEST column will return zero if the row is the first time the document was ever requested, or if 24 hours have passed since the document was first requested.

    Important Notes

    • I added a few other data rows in order to further test and simulate a few conditions.
    • I highly recommend commenting out the WHERE a.RS_HOURS_LAST_REQUEST = 0 and analyze the output to further understand how this is working. You can un-comment when you want the final result.

    Schema (PostgreSQL v15)

    CREATE TABLE api_doc_requests (
      id text,
      doc_num text,
      request_datetime timestamp
    );
    INSERT INTO api_doc_requests (id, doc_num, request_datetime) VALUES ('id1','doc1','2024-01-01 00:00:00');
    INSERT INTO api_doc_requests (id, doc_num, request_datetime) VALUES ('id2','doc2','2024-01-01 00:00:00');
    INSERT INTO api_doc_requests (id, doc_num, request_datetime) VALUES ('id3','doc1','2024-01-01 04:30:00');
    INSERT INTO api_doc_requests (id, doc_num, request_datetime) VALUES ('id4','doc1','2024-01-02 01:00:00');
    INSERT INTO api_doc_requests (id, doc_num, request_datetime) VALUES ('id5','doc1','2024-01-01 06:00:00');
    INSERT INTO api_doc_requests (id, doc_num, request_datetime) VALUES ('id6','doc1','2024-01-01 10:30:00');
    INSERT INTO api_doc_requests (id, doc_num, request_datetime) VALUES ('id7','doc2','2024-01-01 03:30:00');
    INSERT INTO api_doc_requests (id, doc_num, request_datetime) VALUES ('id8','doc2','2024-01-01 11:30:00');
    INSERT INTO api_doc_requests (id, doc_num, request_datetime) VALUES ('id9','doc1','2024-01-02 06:00:00');
    INSERT INTO api_doc_requests (id, doc_num, request_datetime) VALUES ('id10','doc1','2024-01-02 10:30:00');
    

    Query #1

    WITH RECURSIVE CTE_DOC_REQUESTS AS
      (SELECT A.*
       FROM
         (SELECT A.ID,
                 A.DOC_NUM,
                 A.REQUEST_DATETIME,
                 ROW_NUMBER() OVER (PARTITION BY A.DOC_NUM
                                    ORDER BY A.REQUEST_DATETIME,
                                             A.ID) AS ROW_NUM,
                                   ROUND(0.00, 2) AS HOURS_LAST_REQUEST,
                                   ROUND(0.00, 2) AS RS_HOURS_LAST_REQUEST
          FROM API_DOC_REQUESTS A) A
       WHERE A.ROW_NUM = 1
       UNION SELECT A.ID,
                    A.DOC_NUM,
                    A.REQUEST_DATETIME,
                    A.ROW_NUM,
                    ROUND(EXTRACT(EPOCH
                                  FROM A.REQUEST_DATETIME - B.REQUEST_DATETIME) / 3600, 2) AS HOURS_LAST_REQUEST,
                    CASE
                    
                        WHEN ROUND(EXTRACT(EPOCH
                                           FROM A.REQUEST_DATETIME - B.REQUEST_DATETIME) / 3600, 2) + B.RS_HOURS_LAST_REQUEST >= 24 THEN 0.00
                    
                        ELSE ROUND(EXTRACT(EPOCH
                                           FROM A.REQUEST_DATETIME - B.REQUEST_DATETIME) / 3600, 2) + B.RS_HOURS_LAST_REQUEST
                    END AS RS_HOURS_LAST_REQUEST
       FROM
         (SELECT A.ID,
                 A.DOC_NUM,
                 A.REQUEST_DATETIME,
                 ROW_NUMBER() OVER (PARTITION BY A.DOC_NUM
                                    ORDER BY A.REQUEST_DATETIME,
                                             A.ID) AS ROW_NUM
          FROM API_DOC_REQUESTS A) A
       INNER JOIN CTE_DOC_REQUESTS B ON A.DOC_NUM = B.DOC_NUM
       AND A.ROW_NUM = B.ROW_NUM + 1)
    SELECT *
    FROM CTE_DOC_REQUESTS A
    
    WHERE A.RS_HOURS_LAST_REQUEST = 0
    ORDER BY A.DOC_NUM,
             A.REQUEST_DATETIME,
             A.ID;
    
    id doc_num request_datetime row_num hours_last_request rs_hours_last_request
    id1 doc1 2024-01-01T00:00:00.000Z 1 0.00 0.00
    id4 doc1 2024-01-02T01:00:00.000Z 5 14.50 0.00
    id2 doc2 2024-01-01T00:00:00.000Z 1 0.00 0.00

    View on DB Fiddle