Search code examples
xmlhadoophivehive-serde

XML Schema to Hive Schema


I am trying to load a xml file into hive table. I am using xml serde here. I am able to load simple flat xml files. But when there are nested elements in the xml, I am using hive complex data types to store them (for e.g., array<struct>). Below is the sample xml that I am trying to load. My goal is to load all elements, attributes and content into hive table.

<description action="up">
    <name action="aorup" ln="te">
    this is name1
    </name>
    <name action="aorup" ln="tm">
    this is name2
    </name>
    <name action="aorup" ln="hi">
    this is name2
    </name>
</description>

Hive output that I am trying to get is...

{action:"up", name:[{action:"aorup", ln:"te", content:"this is name1"}, {action:"aorup", ln:"tm", content:"this is name2"}, {action:"aorup", ln:"hi", content:"this is name3"}]}

I wanted to load this entire xml into a single hive column. I tried the following:

CREATE TABLE description(
description STRUCT<
Action:STRING, 
name:ARRAY<STRUCT<
    Action:STRING, ln:STRING, content:STRING
    >>
>)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"xml.processor.class"="com.ximpleware.hive.serde2.xml.vtd.XmlProcessor",
"column.xpath.description"="/description")
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES ("xmlinput.start"="<description ","xmlinput.end"= "</description>");

But I am getting null value for Label field. Can someone help me?

Thanks


Solution

  • create external table description
    (
        description struct<action:string,description:array<struct<action:string,ln:string,name:string>>>
    )
    row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
    with serdeproperties 
    (
        "column.xpath.description" = "/description"
    )
    stored as 
    inputformat     'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
    outputformat    'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
    tblproperties   
    (
        "xmlinput.start" = "<description "
       ,"xmlinput.end"   = "</description>"
    )
    ;
    

    select * from description
    ;
    

    +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |                                                                                         description                                                                                         |
    +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | {"action":"up","description":[{"action":"aorup","ln":"te","name":"this is name1"},{"action":"aorup","ln":"tm","name":"this is name2"},{"action":"aorup","ln":"hi","name":"this is name2"}]} |
    +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+