Search code examples
sqlpostgresqlgreatest-n-per-group

Query Value by Max Date in Postgresql


I already asked this question here but there contained less information about my question. So, I create a new question with more information.

Here is the sample table that I have. Each row contains the filled data by the user at every time. So that the timestamp column will not be null through the whole table. There may be unrecorded value under item, if the user didn't fill. The id is the auto-generated column for each record.

CREATE TABLE tbl (id int, customer_id text, item text, value text, timestamp timestamp);    
INSERT INTO tbl VALUES
(1, '001', 'price', '1000', '2021-11-01 01:00:00'),
(2, '001', 'price', '1500', '2021-11-02 01:00:00'),
(3, '001', 'price', '1400', '2021-11-03 01:00:00'),
(4, '001', 'condition', 'good', '2021-11-01 01:00:00'),
(5, '001', 'condition', 'good', '2021-11-02 01:00:00'),
(6, '001', 'condition', 'ok', '2021-11-03 01:00:00'),
(7, '001', 'feeling', 'sad', '2021-11-01 01:00:00'),
(8, '001', 'feeling', 'angry', '2021-11-02 01:00:00'),
(9, '001', 'feeling', 'fine', '2021-11-03 01:00:00'),
(10, '002', 'price', '1200', '2021-11-01 01:00:00'),
(11, '002', 'price', '1600', '2021-11-02 01:00:00'),
(12, '002', 'price', '2000', '2021-11-03 01:00:00'),
(13, '002', 'weather', 'sunny', '2021-11-01 01:00:00'),
(14, '002', 'weather', 'rain', '2021-11-02 01:00:00'),
(15, '002', 'price', '1900', '2021-11-04 01:00:00'),
(16, '002', 'feeling', 'sad', '2021-11-01 01:00:00'),
(17, '002', 'feeling', 'angry', '2021-11-02 01:00:00'),
(18, '002', 'feeling', 'fine', '2021-11-03 01:00:00'),
(19, '003', 'price', '1000', '2021-11-01 01:00:00'),
(20, '003', 'price', '1500', '2021-11-02 01:00:00'),
(21, '003', 'price', '2000', '2021-11-03 01:00:00'),
(22, '003', 'condition', 'ok', '2021-11-01 01:00:00'),
(23, '003', 'weather', 'rain', '2021-11-02 01:00:00'),
(24, '003', 'condition', 'bad', '2021-11-03 01:00:00'),
(25, '003', 'feeling', 'fine', '2021-11-01 01:00:00'),
(26, '003', 'weather', 'sunny', '2021-11-03 01:00:00'),
(27, '003', 'feeling', 'sad', '2021-11-03 01:00:00')
;

To see clearly, I order the above table by id and timestamp. It doesn't matter.

  • We are using Postgresql Version: PostgreSQL 9.5.19
  • The actual table contains over 4 million rows
  • The item column contains over 500 distinct items, but don't worry. I will use 10 items at most for a query. In the above table, I used only 4 items.
  • We also have another table called Customer_table with a unique Customer_id containing customers' general information.

From the above table, I want to query the data to create a table with the latest date updated data as below. I will use 10 items at most for a query so that there may be 10 columns.

customer_id  price  condition  feeling   weather .......(there may be other columns from item column)
   002        1900    null      fine      rain
   001        1400     ok       fine      null
   003        2000    bad       sad       sunny

This is the query that I get from previous questions, but I asked only for two item.

SELECT customer_id, p.value AS price, c.value AS condition
FROM  (
   SELECT DISTINCT ON (customer_id)
          customer_id, value
   FROM   tbl
   WHERE  item = 'condition'
   ORDER  BY customer_id, timestamp DESC
   ) c
FULL JOIN (
   SELECT DISTINCT ON (customer_id)
          customer_id, value
   FROM   tbl
   WHERE  item = 'price'
   ORDER  BY customer_id, timestamp DESC
   ) p USING (customer_id)

So, if there is any better solution please help me. Thank you.


Solution

  • You may try other approaches using row_number to generate a value to filter your data on the most recent data. You may then aggregate on customer id with the max value for a case expression filtering your records based on the desired row number rn=1 (we will order by descending) and item name.

    These approaches are less verbose and based on the results online seem to be more performant. Let me know how replicating this in your environment works in the comments.

    You may use EXPLAIN ANALYZE to compare this approach to the current one. The results in the online environment provided:

    Current Approach

    | Planning time: 0.129 ms                                                                                                      
    | Execution time: 0.056 ms      
    

    Suggested Approach 1

    | Planning time: 0.061 ms                                                                                                 
    | Execution time: 0.070 ms   
    

    Suggested Approach 2

    | Planning time: 0.047 ms                                                                                                 
    | Execution time: 0.056 ms 
    

    NB. You may use EXPLAIN ANALYZE to compare these approaches in your environment which we cannot replicate online. The results may also vary on each run. Indexes and early filters on the item column are also recommended to improve performance.


    Schema (PostgreSQL v9.5)

    Suggested Approach 1

    SELECT
        t1.customer_id,
        MAX(CASE WHEN t1.item='condition' THEN t1.value END) as conditio,
        MAX(CASE WHEN t1.item='price' THEN t1.value END) as price,
        MAX(CASE WHEN t1.item='feeling' THEN t1.value END) as feeling,
        MAX(CASE WHEN t1.item='weather' THEN t1.value END) as weather
    FROM (
        SELECT
            * ,
            ROW_NUMBER() OVER (
                PARTITION BY customer_id,item
                ORDER BY tbl.timestamp DESC
            ) as rn
        FROM
            tbl 
        -- ensure that you filter based on your desired items
        -- indexes on item column are recommended to improve performance
    ) t1
    WHERE rn=1
    GROUP BY
       1;
    
    customer_id conditio price feeling weather
    001 ok 1400 fine
    002 1900 fine rain
    003 bad 2000 sad sunny

    Suggested Approach 2

    SELECT
        t1.customer_id,
        MAX(t1.value) FILTER (WHERE  t1.item='condition')  as conditio,
        MAX(t1.value) FILTER (WHERE  t1.item='price')  as price,
        MAX(t1.value) FILTER (WHERE  t1.item='feeling')  as feeling,
        MAX(t1.value) FILTER (WHERE  t1.item='weather')  as weather
        
    FROM (
        SELECT
            * ,
            ROW_NUMBER() OVER (
                PARTITION BY customer_id,item
                ORDER BY tbl.timestamp DESC
            ) as rn
        FROM
            tbl 
        -- ensure that you filter based on your desired items
        -- indexes on item column are recommended to improve performance
    ) t1
    WHERE rn=1
    GROUP BY
       1;
    
    customer_id conditio price feeling weather
    001 ok 1400 fine
    002 1900 fine rain
    003 bad 2000 sad sunny

    Current Approach with EXPLAIN ANALYZE

    EXPLAIN(ANALYZE,BUFFERS)
    SELECT customer_id, p.value AS price, c.value AS condition
    FROM  (
       SELECT DISTINCT ON (customer_id)
              customer_id, value
       FROM   tbl
       WHERE  item = 'condition'
       ORDER  BY customer_id, timestamp DESC
       ) c
    FULL JOIN (
       SELECT DISTINCT ON (customer_id)
              customer_id, value
       FROM   tbl
       WHERE  item = 'price'
       ORDER  BY customer_id, timestamp DESC
       ) p USING (customer_id);
    
    QUERY PLAN
    Merge Full Join (cost=35.05..35.12 rows=1 width=128) (actual time=0.025..0.030 rows=3 loops=1)
    Merge Cond: (tbl.customer_id = tbl_1.customer_id)
    Buffers: shared hit=2
    -> Unique (cost=17.52..17.54 rows=1 width=72) (actual time=0.013..0.014 rows=2 loops=1)
    Buffers: shared hit=1
    -> Sort (cost=17.52..17.53 rows=3 width=72) (actual time=0.013..0.013 rows=5 loops=1)
    Sort Key: tbl.customer_id, tbl."timestamp" DESC
    Sort Method: quicksort Memory: 25kB
    Buffers: shared hit=1
    -> Seq Scan on tbl (cost=0.00..17.50 rows=3 width=72) (actual time=0.004..0.006 rows=5 loops=1)
    Filter: (item = 'condition'::text)
    Rows Removed by Filter: 22
    Buffers: shared hit=1
    -> Materialize (cost=17.52..17.55 rows=1 width=64) (actual time=0.010..0.013 rows=3 loops=1)
    Buffers: shared hit=1
    -> Unique (cost=17.52..17.54 rows=1 width=72) (actual time=0.010..0.012 rows=3 loops=1)
    Buffers: shared hit=1
    -> Sort (cost=17.52..17.53 rows=3 width=72) (actual time=0.010..0.010 rows=10 loops=1)
    Sort Key: tbl_1.customer_id, tbl_1."timestamp" DESC
    Sort Method: quicksort Memory: 25kB
    Buffers: shared hit=1
    -> Seq Scan on tbl tbl_1 (cost=0.00..17.50 rows=3 width=72) (actual time=0.001..0.003 rows=10 loops=1)
    Filter: (item = 'price'::text)
    Rows Removed by Filter: 17
    Buffers: shared hit=1
    Planning time: 0.129 ms
    Execution time: 0.056 ms

    Suggested Approach 1 with EXPLAIN ANALYZE

    EXPLAIN(ANALYZE,BUFFERS)
    SELECT
        t1.customer_id,
        MAX(CASE WHEN t1.item='price' THEN t1.value END) as price,
        MAX(CASE WHEN t1.item='condition' THEN t1.value END) as conditio
        
    FROM (
        SELECT
            * ,
            ROW_NUMBER() OVER (
                PARTITION BY customer_id,item
                ORDER BY tbl.timestamp DESC
            ) as rn
        FROM
            tbl 
        where item IN ('price','condition')
    ) t1
    WHERE rn=1
    GROUP BY
       1;
    
    QUERY PLAN
    GroupAggregate (cost=17.58..17.81 rows=1 width=96) (actual time=0.039..0.047 rows=3 loops=1)
    Group Key: t1.customer_id
    Buffers: shared hit=1
    -> Subquery Scan on t1 (cost=17.58..17.79 rows=1 width=96) (actual time=0.030..0.040 rows=5 loops=1)
    Filter: (t1.rn = 1)
    Rows Removed by Filter: 10
    Buffers: shared hit=1
    -> WindowAgg (cost=17.58..17.71 rows=6 width=104) (actual time=0.029..0.038 rows=15 loops=1)
    Buffers: shared hit=1
    -> Sort (cost=17.58..17.59 rows=6 width=104) (actual time=0.028..0.030 rows=15 loops=1)
    Sort Key: tbl.customer_id, tbl.item, tbl."timestamp" DESC
    Sort Method: quicksort Memory: 26kB
    Buffers: shared hit=1
    -> Seq Scan on tbl (cost=0.00..17.50 rows=6 width=104) (actual time=0.003..0.008 rows=15 loops=1)
    Filter: (item = ANY ('{price,condition}'::text[]))
    Rows Removed by Filter: 12
    Buffers: shared hit=1
    Planning time: 0.061 ms
    Execution time: 0.070 ms

    Suggested Approach 2 with EXPLAIN ANALYZE

    EXPLAIN(ANALYZE,BUFFERS)
    SELECT
        t1.customer_id,
        MAX(t1.value) FILTER (WHERE  t1.item='price')  as price,
        MAX(t1.value) FILTER (WHERE  t1.item='condition')  as conditio
        
    FROM (
        SELECT
            * ,
            ROW_NUMBER() OVER (
                PARTITION BY customer_id,item
                ORDER BY tbl.timestamp DESC
            ) as rn
        FROM
            tbl 
        where item IN ('price','condition')
    ) t1
    WHERE rn=1
    GROUP BY
       1;
    
    QUERY PLAN
    GroupAggregate (cost=17.58..17.81 rows=1 width=96) (actual time=0.029..0.037 rows=3 loops=1)
    Group Key: t1.customer_id
    Buffers: shared hit=1
    -> Subquery Scan on t1 (cost=17.58..17.79 rows=1 width=96) (actual time=0.021..0.032 rows=5 loops=1)
    Filter: (t1.rn = 1)
    Rows Removed by Filter: 10
    Buffers: shared hit=1
    -> WindowAgg (cost=17.58..17.71 rows=6 width=104) (actual time=0.021..0.030 rows=15 loops=1)
    Buffers: shared hit=1
    -> Sort (cost=17.58..17.59 rows=6 width=104) (actual time=0.019..0.021 rows=15 loops=1)
    Sort Key: tbl.customer_id, tbl.item, tbl."timestamp" DESC
    Sort Method: quicksort Memory: 26kB
    Buffers: shared hit=1
    -> Seq Scan on tbl (cost=0.00..17.50 rows=6 width=104) (actual time=0.003..0.008 rows=15 loops=1)
    Filter: (item = ANY ('{price,condition}'::text[]))
    Rows Removed by Filter: 12
    Buffers: shared hit=1
    Planning time: 0.047 ms
    Execution time: 0.056 ms

    View working demo on DB Fiddle