Search code examples
mysqlamazon-redshiftdistributedquery-performancedistinct-values

How to select data based on Multiple Unique Columns without applying aggregate functions on rest of columns in resultset


This might seem duplicate but I could not find an answer that matches my requirements. Referred here but its not quite the same and answers does not apply to me. Hence asking.

SELECT tab1.col1, tab1.col2, tab1.col3, tab2.col3, tab2.col4 
from 
     (SELECT col1, col2, col3 
      FROM table1 
      GROUP BY col1, col2, col3) AS tab1
JOIN tab2 
  ON tab1.col1 = tab2.col1 
 AND tab1.col2 = tab2.col2
GROUP BY tab1.col1, tab1.col2, tab1.col3, tab2.col3, tab2.col4

Example Data:

Table1                          Tab2            
col1  | col2   |   col3         col1  |  col2 | col3  |  col4
=======================         =============================
page1   image1  referer1        page1   image1  150      75
page1   image1  referer1        page1   image1  120      85
page2   image2  referer2        page2   image2  200      400
page1   image1  referer1        page1   image1  750     1024
page2   image2  referer2        page2   image2  450      575
page1   image1  referer1        page1   image1  600      900

Expected Output:

tab1.col1  |  tab1.col2  |  tab1.col3  |  tab2.col3  |   tab2.col4
==================================================================
  page1       image1        referer1        600            900
  page2       image2        referer2        200            400

Here the last group by returns unique rows but all the tab1 cols are repeating which I don't want and in this case the records only vary on tab2.col1 and tab2.col2. Now the requirement is that I want tab1.col1, tab1.col2, tab1.col3 to be unique and only a pair of respective tab2.col1 and tab2.col2 for those columns from tab2. Now I cannot remove the tab2.col1, tab2.col2 from the second group by and apply a min or max aggregate function because in that case I wont get the tab2.col1 value which maps to the tab2.col2 value for a particular record.

Note: I am using Amazon Redshift as DB. The subquery is important since that is actually a complicated result set produced from 3 table join and if I use a join of these 3 tables directly with tab2 then the query runs forever. For the sake of simplicity of this question let assume that the subquery returns col1, col2, col3 from table1. Tab2 and tab1 are ginormous tables :D... the subquery improves performance considerably (reduced 20mins to ~2mins).


Solution

  • If below result is your requirement,

    tab1.col1  |  tab1.col2  |  tab1.col3  |  tab2.col3  |   tab2.col4
    ==================================================================
      page1       image1        referer1        120            85
      page2       image2        referer2        200            400
    

    then you can achieve it using below redshift sql query

    SELECT tab1.col1, tab1.col2, tab1.col3, tab2.col3, tab2.col4
    FROM
      (SELECT col1, col2, col3
       FROM table1
       GROUP BY col1, col2, col3) AS tab1
    JOIN
      (SELECT col1, col2, col3, col4
       FROM
         (SELECT col1, col2, col3, col4, ROW_NUMBER() OVER(PARTITION BY col1, col2 ORDER BY col3 ASC, col4 ASC) row_num
          FROM table2) tab2
       WHERE row_num = 1) tab2 ON tab1.col1 = tab2.col1
          AND tab1.col2 = tab2.col2
    

    The key in the above sql statement is "ROW_NUMBER() OVER(PARTITION BY col1, col2 ORDER BY col3 ASC, col4 ASC)". This will give you min value of tab2.col3 and min value of tab2.col4. If you need max value, then change the order to DESC

    Hope this should solve your question.

    You can also check the result-set here which I have written