Search code examples
parquetimpala

How to display all the fields in a complex data column (map type) in Impala?


I have a datatable in Impala having a complex column with a structure: key and value. I want to show in each row the entire fields of that column: keys and values for each row, when I SELECT the data.

Desired table:

Id,             map_tag
100, {building:yes, type:apartment, street:street_a, number:3} 
101, {building:yes, type:hotel}
102, {building:yes, type: bank}

If I use:

SELECT id, GROUP_CONCAT(kv_pair) map_tag
FROM (
SELECT dt.id id, CONCAT(cc.key,':',cc.value) kv_pair
FROM datatable dt, datatable.complex_column cc
) T
GROUP BY id

Then I get one only Id, instead of many Ids. Here a screenshot of it: enter image description here


Solution

  • According to Impala Complex Types doc, "references to fields within MAP columns use the KEY and VALUE pseudocolumns". This doc also provides the syntax to query MAP fields. So a combination of a SELECT and GROUP_CONCAT should do it:

    SELECT id, GROUP_CONCAT(kv_pair) map_tag
    FROM (
      SELECT dt.id id, CONCAT(cc.key,':',cc.value) kv_pair
      FROM datatable dt, dt.complex_column cc
    ) T
    GROUP BY id
    
    Query submitted at: 2020-08-24 12:34:17 (Coordinator: https://impalac:25000)
    +-----+---------------------------------------------------------+
    | id  | map_tag                                                 | 
    +-----+---------------------------------------------------------+
    | 102 | building:yes, type:bank                                 |
    | 100 | building:yes, type:apartment, street:street_a, number:3 |
    | 101 | building:yes, type:hotel                                |
    +-----+---------------------------------------------------------+
    Fetched 3 row(s) in 0.42s
    

    Please note that when referencing a map field while doing join, you should use an alias for the "main" table if you did assign it an alias. I.e. in the above example, the inner SELECT uses datatable dt so the reference to complex_column should look like dt.complex_column (not actual_table.column_name which would be "normal").