I have an Azure Databricks table that looks like this:
CREATE TABLE finlog
(
AccountNumber STRING,
ClientControlNumber STRING,
LogDetails ARRAY<STRUCT<DetailId: STRING, ControlId: STRING, Amount: STRING, Label: STRING>>,
OriginSource STRING
);
As you can see, the LogDetails
is actually an array of structs. Sometimes its null
, other times its an array of only 1 struct, and sometimes its an array of dozens of structs.
I would like to write a query against this table. I would like to search for all "Qualified" finlog
records. A "Qualified" record is one that:
LogDetails
array; andControlId
matching the same record's ClientControlNumber
field; andLabel
value of "ACTIVE".So for instance, an example of a "Qualified" record would be one whose ClientControlNumber
is, say, "12345". It might have 3 structs in its LogDetails
array, and one of those structs has a ControlId
of "12345" and has a Label
of "ACTIVE".
I would like this query to return a single column called LogDetailsAmount
. If a record is "Qualified" using the criteria I gave above, then its LogDetailsAmount
value would be the LogDetailsAmount.Amount
of the LogDetailsAmount
struct that matched. Otherwise, if a record is not qualified, its LogDetailsAmount
value should be 0.00.
My best attempt thus far is:
SELECT
CASE
WHEN EXISTS (SELECT 1
FROM finlog AS fl
LATERAL VIEW OUTER EXPLODE(fl.LogDetails) AS ld
WHERE ld.ControlId = fl.ClientControlNumber
AND ld.Label = "ACTIVE")
THEN (SELECT ld.Amount
FROM finlog AS fl
LATERAL VIEW OUTER EXPLODE(fl.LogDetails) AS ld
WHERE ld.ControlId = fl.ClientControlNumber
AND ld.Label = "ACTIVE"
LIMIT 1)
ELSE 0.00
END AS LogDetailAmount
FROM
finlog AS fl;
I think I'm close but something is off.
All (and I mean all) values for the LogDetailAmount
field are coming back as the same value: -0.95
. The query returns millions of results.
I wonder if I'm matching against the same record and somehow just copying-n-pasting the same result over and over again, instead of doing the matching on a row-by-row basis?
I have tried your Query and I see same results as you said:
%sql
SELECT
CASE
WHEN EXISTS (SELECT 1
FROM finlog AS fl
LATERAL VIEW OUTER EXPLODE(fl.LogDetails) AS ld
WHERE ld.ControlId = fl.ClientControlNumber
AND ld.Label = "ACTIVE")
THEN (SELECT ld.Amount
FROM finlog AS fl_inner
LATERAL VIEW OUTER EXPLODE(fl_inner.LogDetails) AS ld
WHERE ld.ControlId = fl_inner.ClientControlNumber
AND ld.Label = "ACTIVE"
LIMIT 1)
ELSE 0.00
END AS LogDetailAmount
FROM
finlog AS fl;
Results:
LogDetailAmount
55.00
55.00
55.00
55.00
I tried the below query:
In the below query joining on a derived table that includes rows with the 'ACTIVE' label.
These rows are then spread out into individual records using the OUTER EXPLODE
function.
Using left join
all rows finlog table will be included in the output, even if there is no match in the derived table.
%sql
SELECT
CASE
WHEN ld_active.ControlId IS NOT NULL AND ld_active.Label = 'ACTIVE'
THEN ld_active.Amount
ELSE 0.00
END AS LogDetailAmount
FROM
finlog AS main_fl
LEFT JOIN
(SELECT
fl.AccountNumber,
ld.Amount,
ld.ControlId,
ld.Label
FROM
finlog AS fl
LATERAL VIEW OUTER EXPLODE(fl.LogDetails) AS ld
WHERE
ld.Label = 'ACTIVE'
) AS ld_active
ON main_fl.AccountNumber = ld_active.AccountNumber
AND ld_active.ControlId = main_fl.ClientControlNumber;
Results:
LogDetailAmount
55.00
45.00
50.00
20.00
0.00
CASE
Statement checking the conditions within the WHEN
and
ELSE
clausesstrong text to determine the value of LogDetailAmount. If
ld_active.ControlId is not null
, and ld_active.Label is 'ACTIVE', then LogDetailAmount is set to
ld_active.Amount. If the conditions are not met, it sets LogDetailAmount to 0.00.FROM
Clause specifing the source table as finlog with the alias
main_fl.LATERAL VIEW OUTER EXPLODE
to
explode
the LogDetails array into separate rows for each struct.
The WHERE
clause filters the rows where the label is 'ACTIVE'.ON
Clause specifing the join conditions, linking the main_fl table
with the ld_active derived table based on AccountNumber and
ControlId.