Search code examples
hivehive-serde

How do I set the SerDe XML schema correctly?


I've got this XML:

  <AssetCrossReferences Ordered="false">
    <AssetCrossReference AssetID="F7961393-01" Type="Primary Image"/>
    <AssetCrossReference AssetID="M0504-01" Type="Vendor Logo"/>
    <AssetCrossReference AssetID="F7961393-02" Type="Colour Photograph"/>
 </AssetCrossReferences><Specification Ordered="true">

I want the end result to look like this:

AssetID:F7961393-01, Type:Primary Image
AssetID:M0504-01, Type:Vendor Logo
AssetID:F7961393-02, Type:Colour Photograph

How do I do that?


Solution

  • Use a Struct

    create external table test 
    (
       asset STRUCT<AssetID:STRING,Type:STRING>
    )
    ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
    with serdeproperties 
    (
      "column.xpath.asset"="/AssetCrossReferences/AssetCrossReference"
    )
    stored as inputformat "com.ibm.spss.hive.serde2.xml.XmlInputFormat"
    outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
    location "file:///yourfilepath" 
    tblproperties 
    (
      "xmlinput.start"="<AssetCrossReferences",
      "xmlinput.end"="</AssetCrossReferences>"
    );
    

    Then

    select * from test;