I have a hive table that acts as my source table. I also have one more hive table that acts as target. The DDL of both the source table and target table is same, except that a few journaling columns have been added in the target table. Below are the DDLs: Source:
CREATE EXTERNAL TABLE source.customer_detail(
id string,
name string,
city string,
properties_owned array<struct<property_addr:string, location:string>>
)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION
'/user/aiman/customer_detail';
Target:
CREATE EXTERNAL TABLE target.customer_detail(
id string,
name string,
city string,
properties_owned array<struct<property_addr:string, location:string>>
audit_insterted_ts timestamp,
audit_dml_action char(1)
)
PARTITIONED BY (audit_active_flag char(1))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
STORED AS ORC
LOCATION
'/user/aiman/target/customer_detail';
Data at Source:
+---------------------+--------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------+
| customer_detail.id | customer_detail.name | customer_detail.city | customer_detail.properties_owned |
+---------------------+--------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------+
| 1 | Aiman Sarosh | kolkata | [{"property_addr":"H1 Block Saltlake","location":"kolkata"},{"property_addr":"New Property Added Saltlake","location":"kolkata"}] |
| 2 | Justin | delhi | [{"property_addr":"some address in delhi","location":"delhi"}] |
+---------------------+--------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------+
Data at Target:
+---------------------+--------------------------+-------------------------+------------------------------------------------------------------+--------------------------------------+-----------------------------------+------------------------------------+
| customer_detail.id | customer_detail.name | customer_detail.city | customer_detail.properties_owned | customer_detail.audit_insterted_ts | customer_detail.audit_dml_action | customer_detail.audit_active_flag |
+---------------------+--------------------------+-------------------------+------------------------------------------------------------------+--------------------------------------+-----------------------------------+------------------------------------+
| 1 | Aiman Sarosh | kolkata | [{"property_addr":"H1 Block Saltlake","location":"kolkata"}] | 2018-09-04 06:55:12.361 | I | A |
| 2 | Justin | delhi | [{"property_addr":"some address in delhi","location":"delhi"}] | 2018-09-05 08:36:39.023 | I | A |
+---------------------+--------------------------+-------------------------+---------------------------------------------------------------------------------------------------------+-----------------------------------+------------------------------------+
When I run the query below, it should fetch me 1 record that has been modified, i.e.:
+---------------------+--------------------------+-------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-----------------------------------+------------------------------------+
| customer_detail.id | customer_detail.name | customer_detail.city | customer_detail.properties_owned | customer_detail.audit_insterted_ts | customer_detail.audit_dml_action | customer_detail.audit_active_flag |
+---------------------+--------------------------+-------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-----------------------------------+------------------------------------+
| 1 | Aiman Sarosh | kolkata | [{"property_addr":"H1 Block Saltlake","location":"kolkata"},{"property_addr":"New Property Added Saltlake","location":"kolkata"}] | 2018-09-05 07:15:10.321 | U | A |
+---------------------+--------------------------+-------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-----------------------------------+------------------------------------+
Basically, {"property_addr":"New Property Added Saltlake","location":"kolkata"}
element has been added to array column properties_owned
for record ID 1 at the source
.
Query:
SELECT --fetch modified/updated records in source
source.id AS id,
source.name AS name,
source.city AS city,
source.properties_owned AS properties_owned,
current_timestamp() AS audit_insterted_ts,
'U' AS audit_dml_action,
'A' AS audit_active_flag
FROM source.customer_detail source
INNER JOIN target.customer_detail jrnl
ON source.id=jrnl.id
WHERE source.name!=jrnl.name
OR source.city!=jrnl.city
OR source.properties_owned!=jrnl.properties_owned
But it is throwing error:
Error: Error while compiling statement: FAILED: SemanticException [Error 10016]: Line 14:3 Argument type mismatch 'properties_owned': The 1st argument of NOT EQUAL is expected to a primitive type, but list is found (state=42000,code=10016)
How do I compare two columns in WHERE clause having complex datatypes, when I am using JOINS ?
I can use .POS
and .ITEM
but this wont be helpful because my column is an array of structure and the length of array can be different.
I fixed this using LATERAL VIEW explode()
.
And then used concat_ws()
in conjunction with collect_list(array<string>)
method on the exploded columns, which finally gave me a single string
which I compared:
SELECT --fetch modified/updated records in source
source.id AS id,
source.name AS name,
source.city AS city,
source.properties_owned AS properties_owned,
current_timestamp() AS audit_insterted_ts,
'U' AS audit_dml_action,
'A' AS audit_active_flag
FROM source.customer_detail source
INNER JOIN target.customer_detail jrnl
ON source.id=jrnl.id
WHERE source.id IN
(
SELECT t1.id
FROM
(
SELECT src.id,concat_ws(',', collect_list(src.property_addr),collect_list(src.location)) newcol
FROM
(
SELECT id, prop_owned.property_addr AS property_addr, prop_owned.location AS location
FROM source.customer_detail LATERAL VIEW explode(properties_owned) exploded_tab AS prop_owned
) src
GROUP BY src.id
) t1
INNER JOIN
(
SELECT trg.id,concat_ws(',', collect_list(trg.property_addr),collect_list(trg.location)) newcol
FROM
(
SELECT id, prop_owned.property_addr AS property_addr, prop_owned.location AS location
FROM target.customer_detail LATERAL VIEW explode(properties_owned) exploded_tab AS prop_owned
) trg
GROUP BY trg.id
) t2
ON t1.id=t2.id
WHERE t1.newcol!=t2.newcol
Hope someone finds this useful and helpful. :-)