Search code examples
sqldatabricksazure-databricks

Azure Databricks query against struct array column returning all the same values


I have an Azure Databricks table that looks like this:

CREATE TABLE finlog 
(
    AccountNumber       STRING,
    ClientControlNumber STRING,
    LogDetails          ARRAY<STRUCT<DetailId: STRING, ControlId: STRING, Amount: STRING, Label: STRING>>,
    OriginSource        STRING
);

As you can see, the LogDetails is actually an array of structs. Sometimes its null, other times its an array of only 1 struct, and sometimes its an array of dozens of structs.

I would like to write a query against this table. I would like to search for all "Qualified" finlog records. A "Qualified" record is one that:

  1. Has 1+ LogDetails structs in its LogDetails array; and
  2. One of those structs has a ControlId matching the same record's ClientControlNumber field; and
  3. The same struct has a Label value of "ACTIVE".

So for instance, an example of a "Qualified" record would be one whose ClientControlNumber is, say, "12345". It might have 3 structs in its LogDetails array, and one of those structs has a ControlId of "12345" and has a Label of "ACTIVE".

I would like this query to return a single column called LogDetailsAmount. If a record is "Qualified" using the criteria I gave above, then its LogDetailsAmount value would be the LogDetailsAmount.Amount of the LogDetailsAmount struct that matched. Otherwise, if a record is not qualified, its LogDetailsAmount value should be 0.00.

My best attempt thus far is:

SELECT
    CASE
        WHEN EXISTS (SELECT 1
                     FROM finlog AS fl
                     LATERAL VIEW OUTER EXPLODE(fl.LogDetails) AS ld
                     WHERE ld.ControlId = fl.ClientControlNumber
                       AND ld.Label = "ACTIVE")
            THEN (SELECT ld.Amount
                  FROM finlog AS fl
                  LATERAL VIEW OUTER EXPLODE(fl.LogDetails) AS ld
                  WHERE ld.ControlId = fl.ClientControlNumber
                    AND ld.Label = "ACTIVE"
                  LIMIT 1)
            ELSE 0.00
    END AS LogDetailAmount
FROM
    finlog AS fl;

I think I'm close but something is off.

All (and I mean all) values for the LogDetailAmount field are coming back as the same value: -0.95. The query returns millions of results.

I wonder if I'm matching against the same record and somehow just copying-n-pasting the same result over and over again, instead of doing the matching on a row-by-row basis?


Solution

  • I have tried your Query and I see same results as you said:

    %sql
    SELECT
        CASE
            WHEN EXISTS (SELECT 1
                         FROM finlog AS fl
                         LATERAL VIEW OUTER EXPLODE(fl.LogDetails) AS ld
                         WHERE ld.ControlId = fl.ClientControlNumber
                           AND ld.Label = "ACTIVE")
                THEN (SELECT ld.Amount
                      FROM finlog AS fl_inner
                      LATERAL VIEW OUTER EXPLODE(fl_inner.LogDetails) AS ld
                      WHERE ld.ControlId = fl_inner.ClientControlNumber
                        AND ld.Label = "ACTIVE"
                        LIMIT 1)
                ELSE 0.00
        END AS LogDetailAmount
    FROM
        finlog AS fl;
    

    Results:

    LogDetailAmount
    55.00
    55.00
    55.00
    55.00
    

    I tried the below query: In the below query joining on a derived table that includes rows with the 'ACTIVE' label. These rows are then spread out into individual records using the OUTER EXPLODE function. Using left join all rows finlog table will be included in the output, even if there is no match in the derived table.

    %sql
    SELECT 
       CASE 
           WHEN ld_active.ControlId IS NOT NULL AND ld_active.Label = 'ACTIVE' 
           THEN ld_active.Amount 
           ELSE 0.00 
       END AS LogDetailAmount 
    FROM 
       finlog AS main_fl 
       LEFT JOIN 
          (SELECT 
              fl.AccountNumber, 
              ld.Amount, 
              ld.ControlId, 
              ld.Label 
           FROM 
               finlog AS fl 
               LATERAL VIEW OUTER EXPLODE(fl.LogDetails) AS ld
           WHERE 
               ld.Label = 'ACTIVE'
          ) AS ld_active 
          ON main_fl.AccountNumber = ld_active.AccountNumber 
             AND ld_active.ControlId = main_fl.ClientControlNumber;
         
    

    Results:

    LogDetailAmount
    55.00
    45.00
    50.00
    20.00
    0.00
    
    • In the CASE Statement checking the conditions within the WHEN and ELSE clausesstrong text to determine the value of LogDetailAmount. If ld_active.ControlId is not null, and ld_active.Label is 'ACTIVE', then LogDetailAmount is set to ld_active.Amount. If the conditions are not met, it sets LogDetailAmount to 0.00.
    • FROM Clause specifing the source table as finlog with the alias main_fl.
    • Left join with a derived table (ld_active) that contains rows exploded from LogDetails where the label is 'ACTIVE'. The join condition ensures that the rows are matched based on AccountNumber and ControlId.
    • Derived Table (ld_active) using LATERAL VIEW OUTER EXPLODE to explode the LogDetails array into separate rows for each struct. The WHERE clause filters the rows where the label is 'ACTIVE'.
    • ON Clause specifing the join conditions, linking the main_fl table with the ld_active derived table based on AccountNumber and ControlId.