Search code examples
xpathpentahokettle

Xpath to fetch single Elements as well all sub elemnts


Stuck in a weird scenario. I need to parse an incoming XML file and shred it into database. I am using Pentaho Kettle's 'Get XML Data' Component. My Loop Xpath is : readable/trans/header///*

Sample data is

 <readable>
    <trans>
       <header>
          <single>Data1</single>
          <A>
             <A1>DATA</A1>
            <A2>DATA</A2>   
         </A>
         <A>
            <A3>DATA</A3>
            <A4>DATA</A4>   
         </A>
         <B>
            <B1>DATA</B1>
            <B2>DATA</B2>
               <C>
                   <C1>data</C1>
                   <C2>data</C2>
               </C>
         </B>

      </header> 
   </trans>
</readable>

As can be seen, depth of element C is maximum and it is not there everywhere. Randomly it can be present in some elements. Base on that, in order to cover the all elements till Depth C, My Xpath has three levels.

But now problem is I am not able to get values of single elements.

Name                               XPATH                       Sample Value fetched

TAG_value                           .                            data

TAG_NAME                           name(.)                        C1

TAG_PARENT_NAME                   name(../.)                      C

How to fetch values of "B1" and "B2" respectively which falls under "B" but above "C".

Basically, how to fetch

<B1>DATA</B1>
<B2>DATA</B2> 

And remember, we should have single 'loop Xpath' as I mentioned above, with help of which I should be able to fetch all values, as I need to shred the XML into database. Thanks in Advance, Folks.


Solution

  • Your requirements are a bit unclear, here are a few possible solutions.

    If you know the structure of the entire document and names of those elements beforehand:

    /readable/trans/header/B/*[self::B1 or self::B2]
    

    If you do not know the structure of the document, but know the names of the target elements:

    //*[self::B1 or self::B2]
    

    If you know the structure of the document,but do not know the names of the target element, but know that they must be immediate children of a B element and must not be the C element:

    /readable/trans/header/B/*[not(self::C)]
    

    All those expressions return the same result, that is (individual results separated by -------)

    <B1>DATA</B1>
    -----------------------
    <B2>DATA</B2>