Search code examples
sqlxmlxpathhivehiveql

In Hive, how to read through NULL / empty tags present within an XML using explode(XPATH(..)) function?


In below Hive-query, I need to read the null / empty "string" tags as well, from the XML content. Only the non-null "string" tags are getting considered within the XPATH() list now.

with your_data as (
select  '<ParentArray>
    <ParentFieldArray>
        <Name>ABCD</Name>
        <Value>
            <string>111</string>
            <string></string>
            <string>222</string>
        </Value>
    </ParentFieldArray>
    <ParentFieldArray>
        <Name>EFGH</Name>
        <Value>
            <string/>
            <string>444</string>
            <string></string>
            <string>555</string>

        </Value>
    </ParentFieldArray>
</ParentArray>' as xmlinfo
)

select Name, Value 
  from your_data d
       lateral view outer explode(XPATH(xmlinfo, 'ParentArray/ParentFieldArray/Name/text()')) pf as  Name
       lateral view outer explode(XPATH(xmlinfo, concat('ParentArray/ParentFieldArray[Name="', pf.Name, '"]/Value/string/text()'))) vl as Value;

Expected output from query:

Name    Value
ABCD    111
ABCD    
ABCD    222
EFGH    
EFGH    444
EFGH    
EFGH    555

Solution

  • The problem here is that XPATH returns NodeList and if it contains empty node, it is not included in the list.

    Concatenation with some string (in XPATH): concat(/Value/string/text()," ") does not work here:

    Caused by: javax.xml.xpath.XPathExpressionException: com.sun.org.apache.xpath.internal.XPathException: Can not convert #STRING to a NodeList!

    at com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:195)

    Easy solution is to replace <string></string> and <string/> with <string>NULL</string> and then you can convert 'NULL' string to null.

    Demo:

    with your_data as (
    select  '<ParentArray>
        <ParentFieldArray>
            <Name>ABCD</Name>
            <Value>
                <string>111</string>
                <string></string>
                <string>222</string>
            </Value>
        </ParentFieldArray>
        <ParentFieldArray>
            <Name>EFGH</Name>
            <Value>
                <string/>
                <string>444</string>
                <string></string>
                <string>555</string>
            </Value>
        </ParentFieldArray>
    </ParentArray>' as xmlinfo
    )
    
    select name, case when value='NULL' then null else value end value
      from (select regexp_replace(xmlinfo,'<string></string>|<string/>','<string>NULL</string>') xmlinfo 
              from your_data d
           ) d
           lateral view outer explode(XPATH(xmlinfo, 'ParentArray/ParentFieldArray/Name/text()')) pf as  Name
           lateral view outer explode(XPATH(xmlinfo, concat('ParentArray/ParentFieldArray[Name="', pf.Name, '"]/Value/string/text()'))) vl as value
    

    Result:

    name    value
    ABCD    111
    ABCD    
    ABCD    222
    EFGH    
    EFGH    444
    EFGH    
    EFGH    555