Search code examples
xmlhivenestedhive-serde

Loading nested XML data into Hive using SerDe


I am trying load nested XML data into Hive. Sample data is as follows...

<CustomerOrders>
  <Customers>
    <CustID>ALFKI</CustID>
    <Orders>
      <OrderID>10643</OrderID>
      <CustomerID>ALFKI</CustomerID>
      <OrderDate>1997-08-25</OrderDate>
    </Orders>
    <Orders>
      <OrderID>10692</OrderID>
      <CustomerID>ALFKI</CustomerID>
      <OrderDate>1997-10-03</OrderDate>
    </Orders>
    <CompanyName>Alfreds Futterkiste</CompanyName>
  </Customers>
  <Customers>
    <CustID>ANATR</CustID>
    <Orders>
      <OrderID>10308</OrderID>
      <CustomerID>ANATR</CustomerID>
      <OrderDate>1996-09-18</OrderDate>
    </Orders>
    <CompanyName>Ana Trujillo Emparedados y helados</CompanyName>
  </Customers>
</CustomerOrders>

Below is the command that I am using:

CREATE TABLE CUSTOMERORDERS(
          CustID STRING,
          Orders ARRAY<STRUCT<OrderID:STRING,CustomerID:STRING,OrderDate:STRING>>,
          CompanyName STRING)
          ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
          WITH SERDEPROPERTIES (
          "column.xpath.CustID"="/Customers/CustID/text()",
          "column.xpath.Orders"="/Customers/Orders",
          "column.xpath.OrderID"="/Customers/Orders/OrderID",
          "column.xpath.CustomerID"="/Customers/Orders/CustomerID",
          "column.xpath.OrderDate"="/Customers/Orders/OrderDate",
          "column.xpath.CompanyName"="/Customers/CompanyName/text()")
          STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
          OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
          TBLPROPERTIES ("xmlinput.start"="<Customers>","xmlinput.end"= "</Customers>");

Output that I am gettings is:

    hive> select * from customerorders;
OK
ALFKI   [{"orderid":null,"customerid":null,"orderdate":null},{"orderid":null,"customerid":null,"orderdate":null}]       Alfreds Futterkiste
ANATR   [{"orderid":null,"customerid":null,"orderdate":null}]   Ana Trujillo Emparedados y helados
Time taken: 0.039 seconds, Fetched: 2 row(s)

I'm getting null values for OrderID, CustomerID and OrderDate. Can anyone help me in solving this problem?

Thanks


Solution

  • create external table customerorders
    (
        custid      string
       ,orders      array<struct<Orders:struct<OrderID:string,CustomerID:string,OrderDate:string>>>
       ,companyname string
    )
    row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
    with serdeproperties 
    (
        "column.xpath.CustID"       = "/Customers/CustID/text()"
       ,"column.xpath.Orders"       = "/Customers/Orders"
       ,"column.xpath.CompanyName"  = "/Customers/CompanyName/text()"
    )
    
    stored as 
    inputformat     'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
    outputformat    'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
    tblproperties 
    (
        "xmlinput.start"    = "<Customers>"
       ,"xmlinput.end"      = "</Customers>"
    );
    

    --

    select * from customerorders
    ;
    

    --

    +--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------+
    | custid |                                                                           orders                                                                            |            companyname             |
    +--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------+
    | ALFKI  | [{"orders":{"orderid":"10643","customerid":"ALFKI","orderdate":"1997-08-25"}},{"orders":{"orderid":"10692","customerid":"ALFKI","orderdate":"1997-10-03"}}] | Alfreds Futterkiste                |
    | ANATR  | [{"orders":{"orderid":"10308","customerid":"ANATR","orderdate":"1996-09-18"}}]                                                                              | Ana Trujillo Emparedados y helados |
    +--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------+