Search code examples
hadoophivehiveqlhadoop2beeline

Hive: How do I compare two columns in WHERE clause having complex datatypes?


I have a hive table that acts as my source table. I also have one more hive table that acts as target. The DDL of both the source table and target table is same, except that a few journaling columns have been added in the target table. Below are the DDLs: Source:

CREATE EXTERNAL TABLE source.customer_detail(
   id string,
   name string,
   city string,
   properties_owned array<struct<property_addr:string, location:string>>
)
ROW FORMAT SERDE
  'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION
  '/user/aiman/customer_detail';

Target:

CREATE EXTERNAL TABLE target.customer_detail(
   id string,
   name string,
   city string,
   properties_owned array<struct<property_addr:string, location:string>>
   audit_insterted_ts timestamp,
   audit_dml_action char(1)
)
PARTITIONED BY (audit_active_flag char(1))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
STORED AS ORC
LOCATION
  '/user/aiman/target/customer_detail';

Data at Source:

+---------------------+--------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------+
| customer_detail.id  |   customer_detail.name   |  customer_detail.city   |                                               customer_detail.properties_owned                                                       |
+---------------------+--------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------+
| 1                   | Aiman Sarosh             |      kolkata            |  [{"property_addr":"H1 Block Saltlake","location":"kolkata"},{"property_addr":"New Property Added Saltlake","location":"kolkata"}]   |
| 2                   | Justin                   |      delhi              |  [{"property_addr":"some address in delhi","location":"delhi"}]                                                                      |
+---------------------+--------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------+

Data at Target:

+---------------------+--------------------------+-------------------------+------------------------------------------------------------------+--------------------------------------+-----------------------------------+------------------------------------+
| customer_detail.id  |   customer_detail.name   |  customer_detail.city   |              customer_detail.properties_owned                    |  customer_detail.audit_insterted_ts  | customer_detail.audit_dml_action  | customer_detail.audit_active_flag  |
+---------------------+--------------------------+-------------------------+------------------------------------------------------------------+--------------------------------------+-----------------------------------+------------------------------------+
| 1                   | Aiman Sarosh             |      kolkata            |  [{"property_addr":"H1 Block Saltlake","location":"kolkata"}]    | 2018-09-04 06:55:12.361              | I                                 | A                                  |
| 2                   | Justin                   |      delhi              |  [{"property_addr":"some address in delhi","location":"delhi"}]  | 2018-09-05 08:36:39.023              | I                                 | A                                  |
+---------------------+--------------------------+-------------------------+---------------------------------------------------------------------------------------------------------+-----------------------------------+------------------------------------+

When I run the query below, it should fetch me 1 record that has been modified, i.e.:

+---------------------+--------------------------+-------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-----------------------------------+------------------------------------+
| customer_detail.id  |   customer_detail.name   |  customer_detail.city   |                                                                  customer_detail.properties_owned                                              |  customer_detail.audit_insterted_ts  | customer_detail.audit_dml_action  | customer_detail.audit_active_flag  |
+---------------------+--------------------------+-------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-----------------------------------+------------------------------------+
| 1                   | Aiman Sarosh             |      kolkata            |  [{"property_addr":"H1 Block Saltlake","location":"kolkata"},{"property_addr":"New Property Added Saltlake","location":"kolkata"}]             | 2018-09-05 07:15:10.321              | U                                 | A                                  |
+---------------------+--------------------------+-------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-----------------------------------+------------------------------------+

Basically, {"property_addr":"New Property Added Saltlake","location":"kolkata"} element has been added to array column properties_owned for record ID 1 at the source.

Query:

SELECT  --fetch modified/updated records in source
   source.id AS id,
   source.name AS name,
   source.city AS city,
   source.properties_owned AS properties_owned,
   current_timestamp() AS audit_insterted_ts,
   'U' AS audit_dml_action,
   'A' AS audit_active_flag
FROM source.customer_detail source
INNER JOIN target.customer_detail jrnl
ON source.id=jrnl.id
WHERE source.name!=jrnl.name
OR source.city!=jrnl.city
OR source.properties_owned!=jrnl.properties_owned

But it is throwing error:

Error: Error while compiling statement: FAILED: SemanticException [Error 10016]: Line 14:3 Argument type mismatch 'properties_owned': The 1st argument of NOT EQUAL  is expected to a primitive type, but list is found (state=42000,code=10016)

How do I compare two columns in WHERE clause having complex datatypes, when I am using JOINS ?
I can use .POS and .ITEM but this wont be helpful because my column is an array of structure and the length of array can be different.


Solution

  • I fixed this using LATERAL VIEW explode().
    And then used concat_ws() in conjunction with collect_list(array<string>) method on the exploded columns, which finally gave me a single string which I compared:

    SELECT  --fetch modified/updated records in source
       source.id AS id,
       source.name AS name,
       source.city AS city,
       source.properties_owned AS properties_owned,
       current_timestamp() AS audit_insterted_ts,
       'U' AS audit_dml_action,
       'A' AS audit_active_flag
    FROM source.customer_detail source
    INNER JOIN target.customer_detail jrnl
    ON source.id=jrnl.id
    WHERE source.id IN
    (
    SELECT t1.id
    FROM
    (
       SELECT src.id,concat_ws(',', collect_list(src.property_addr),collect_list(src.location)) newcol
       FROM
       (
          SELECT id, prop_owned.property_addr AS property_addr, prop_owned.location AS location
          FROM source.customer_detail LATERAL VIEW explode(properties_owned) exploded_tab AS prop_owned
       ) src
       GROUP BY src.id
    ) t1
    INNER JOIN
    (
       SELECT trg.id,concat_ws(',', collect_list(trg.property_addr),collect_list(trg.location)) newcol
       FROM
       (
          SELECT id, prop_owned.property_addr AS property_addr, prop_owned.location AS location
          FROM target.customer_detail LATERAL VIEW explode(properties_owned) exploded_tab AS prop_owned
       ) trg
       GROUP BY trg.id
    ) t2
    ON t1.id=t2.id
    WHERE t1.newcol!=t2.newcol
    

    Hope someone finds this useful and helpful. :-)