Search code examples
xmlxpathhiveapache-pig

Parse XML and store in Hive table


I am using XPath to parse XML using pig, but in my use case i have to parse the entire xml file and move it to hive table. I am thinking of using XPath to parse the XML file and than move that parsed data to hive table using pig. But is there any other way to do that ?


Solution

  • We can parse XML file in Hive using hivexmlserde.

    Create an external Hive table using hivexmlserde and put all xml file on that xml location(All xml file should be similar). Using this serde you need define start and end tag in create table statement and XPath of attributes you want to fetch.

    Please see the example below.

    add jar /home/udf_jars/hivexmlserde-1.0.5.3.jar;
    CREATE EXTERNAL TABLE hive_test_xml(
    col1            string,
    col2            string,
    col3            string
    )
    ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
    WITH SERDEPROPERTIES (
    "column.xpath.col1"="/Books/col1/text()",
    "column.xpath.col2"="/Books/col2/text()",
    "column.xpath.col3"="/Books/col3/text()"
    )
    STORED AS
    INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
    LOCATION ' /user/user1/xml/data'
    TBLPROPERTIES (
    "xmlinput.start"="<Books",
    "xmlinput.end"="</Books>");
    

    For more detailed information, please visit Hive-XML-Options link for more about it.

    Update: How to select Data from table.

    You need to add the hivexmlserde-1.0.5.3.jar before query. Please see the example below.

    hive> add jar /path/to/jar/hivexmlserde-1.0.5.3.jar;
    Added [add jar /path/to/jar/hivexmlserde-1.0.5.3.jar] to class path
    Added resources: [add jar /path/to/jar/hivexmlserde-1.0.5.3.jar]
    hive> use mydatabase_name;
    OK
    Time taken: 0.021 seconds
    hive> select * from xm_table;
    

    You can also set this jar permanently to avoid adding all the time before querying. Please follow below link

    Adding/Defining Jars in Hive permanently