Search code examples
azure-databricksdatabricks-sqldatabricks-unity-catalog

subquery as a join condition in databricks SQL unity catalog


My requirement is to convert an ORACLE SQL query to databricks SQL unity catalog SQL query. Databricks SQL doesnt support scalar subqueries as a join condition, what I need to convert is :

inner join `uc-dev`.iccu_dna.RtxnStatHist rsh
    ON r.AcctNbr = rsh.AcctNbr
    AND r.RtxnNbr = rsh.RtxnNbr
    AND rsh.TimeUniqueExtn = (SELECT MAX(TIMEUNIQUEEXTN)
                              FROM `uc-dev`.iccu_dna.RtxnStatHist rsh2
                              WHERE AcctNbr = rsh.AcctNbr
                              AND RtxnNbr = rsh.RtxnNbr)

I have tried to use CTE like so:

with 
max_rsh
as
(SELECT MAX(TIMEUNIQUEEXTN) as max_TIMEUNIQUEEXTN
                              FROM `uc-dev`.iccu_dna.RtxnStatHist rsh2
                              WHERE AcctNbr IN (SELECT AcctNbr FROM `uc-dev`.iccu_dna.RtxnStatHist)
                                AND RtxnNbr IN (SELECT RtxnNbr FROM `uc-dev`.iccu_dna.RtxnStatHist)
                                )

and later joining this CTE :

inner join max_rsh mtxn
    ON rsh.TimeUniqueExtn = mtxn.max_TIMEUNIQUEEXTN

Can you help me with this conversion? I need to be able to replicate this join condition in a databricks SQL notebook for unity catalog.


Solution

  • You can try the below approach to databricks SQL:

    SELECT *
    FROM `uc-dev`.iccu_dna.Rtxn r
    INNER JOIN `uc-dev`.iccu_dna.RtxnStatHist rsh
        ON r.AcctNbr = rsh.AcctNbr
        AND r.RtxnNbr = rsh.RtxnNbr
        AND rsh.TimeUniqueExtn = (
            SELECT MAX(TIMEUNIQUEEXTN)
            FROM `uc-dev`.iccu_dna.RtxnStatHist rsh2
            WHERE AcctNbr = rsh.AcctNbr
            AND RtxnNbr = rsh.RtxnNbr
        )
    

    I have tried the below approach as an example:

    INPUT tables:

    enter image description here

    enter image description here

    WITH maxTime AS (
      SELECT
        rsh.AcctNbr,
        rsh.RtxnNbr,
        rsh.TimeUniqueExtn,
        ROW_NUMBER() OVER (PARTITION BY rsh.AcctNbr, rsh.RtxnNbr ORDER BY rsh.TimeUniqueExtn DESC) AS rn
      FROM table1 r
      INNER JOIN RtxnStatHist rsh 
        ON r.AcctNbr = rsh.AcctNbr
        AND r.RtxnNbr = rsh.RtxnNbr
    )
    SELECT
      r.AcctNbr,
      r.RtxnNbr,
      rsh.TimeUniqueExtn
    FROM table1 r
    INNER JOIN maxTime rsh
      ON r.AcctNbr = rsh.AcctNbr
      AND r.RtxnNbr = rsh.RtxnNbr
      AND rsh.rn = 1;
    

    Results:

    
    AcctNbr RtxnNbr TimeUniqueExtn
    1   101 2022-01-01T11:00:00Z
    2   102 2022-01-01T12:00:00Z
    3   103 2022-01-01T14:00:00Z
    
    

    In the above code a Common table expression (CTE) named maxTime is created.

    The CTE calculates the maximum TimeUniqueExtn for each (AcctNbr, RtxnNbr) combination using the ROW_NUMBER() window function.

    Then, the main query joins table1 with the maxTime CTE based on the account number, transaction number, and only selects the rows where the row number is 1 (which represents the maximum TimeUniqueExtn for each combination).