Search code examples
sqlpostgresqldistinctgreatest-n-per-groupcrosstab

Combine crosstab function with DISTINCT ON


I have two tables details and data table. I already joined the two tables and the crosstab function is already done.

I want to show only the latest data per serial. Refer to the current and desired output below.

Question: How can I use DISTINCT ON in this crosstab query?

Table details:

serial   | date                 |    line      |   total_judgement
---------+----------------------+--------------+----------------
 123     | 2016/05/21 12:00:00  |      A       |       1
 456     | 2016/05/21 12:02:00  |      A       |       0
 456     | 2016/05/21 12:05:00  |      A       |       0

Table data:

serial   |     date             | readings   |   value
---------+----------------------+------------+-------------
 123     | 2016/05/21 12:00:00  | reading1   |  1.2342
 123     | 2016/05/21 12:00:00  | reading2   |  2.3213
 123     | 2016/05/21 12:00:00  | reading3   |  3.4232
 456     | 2016/05/21 12:00:02  | reading1   |  1.2546
 456     | 2016/05/21 12:00:02  | reading2   |  2.3297
 456     | 2016/05/21 12:00:02  | reading3   |  3.4264
 456     | 2016/05/21 12:00:05  | reading1   |  1.9879
 456     | 2016/05/21 12:00:05  | reading2   |  2.4754
 456     | 2016/05/21 12:00:05  | reading3   |  3.4312

Current Output:

serial   | line |      date            | total_judgement| reading1  |   reading2  |   reading3  
---------+------+----------------------+----------------+-----------+-------------+--------------
123      |  A   |  2016/05/21 12:00:00 |       1        |  1.2342   |   2.3213    |   3.4232      
456      |  A   |  2016/05/21 12:00:02 |       0        |  1.2546   |   2.3297    |   3.4264 
456      |  A   |  2016/05/21 12:00:02 |       0        |  1.9879   |   2.4754    |   3.4312    

Desired Output:

serial   | line |     date             | total_judgement | reading1  |   reading2  |   reading3   
---------+------+----------------------+-----------------+-----------+-------------+--------------
123      |  A   |  2016/05/21 12:00:00 |         1       |  1.2342   |   2.3213    |   3.4232  
456      |  A   |  2016/05/21 12:00:05 |         0       |  1.9879   |   2.4754    |   3.4312 

Here's my Code:

SELECT * FROM crosstab (
            $$ SELECT
                tb2.serial,
                tb1.line,
                tb2.date,
                tb1.total_judgement,
                tb2.readings,
                tb2.value
            FROM
                data tb2
                    INNER JOIN details tb1 ON (tb2.serial = tb1.serial 
             AND   tb2.date = tb1.date)
            ORDER BY tb2.date ASC $$,
            $$ VALUES ('reading1'),('reading2'),('reading3')$$
  ) as ct("S/N" VARCHAR (50),
          "Line" VARCHAR(3),
          "Date" TIMESTAMP,
          "TotalJudgement" CHARACTER(1),
          "Reading1" FLOAT8,
          "Reading2" FLOAT8,
          "Reading3" FLOAT8);

Notes

I need to join the two tables on serial and date.

I think DISTINCT ON may help with this but I don't seem to get correct results when I use DISTINCT ON serial.


Solution

  • Making some assumptions, this is probably what you are looking for:

    SELECT * FROM crosstab (
       $$
       SELECT t2.serial
            , t1.line
            , t2.date
            , t1.total_judgement
            , t2.readings
            , t2.value
       FROM  (SELECT DISTINCT ON (serial) * FROM details ORDER BY serial, date DESC) t1
       JOIN   data t2 USING (serial, date)
       ORDER  BY t2.serial
       $$
     , $$VALUES ('reading1'),('reading2'),('reading3')$$
       ) AS ct("S/N" text
             , "Line" text
             , "Date" timestamp
             , "TotalJudgement" text
             , "Reading1" float8
             , "Reading2" float8
             , "Reading3" float8);

    If you apply DISTINCT ON (serial) after the join, you would only retain a single row from data. Move the DISTINCT step into a subquery on details to get all readings in data for the latest row of each serial in details.

    BTW, DISTINCT and DISTINCT ON ( expression [, ...] ) are not "functions" but SQL constructs. Basics:

    While being at it I made some simplifications to the code. Not essential for the answer.

    If there are many rows per serial in table details, it might be more efficient to use one of these techniques instead of DISTINCT ON:

    I can't be more specific without knowing table definitions, cardinalities etc.