How to parse a table content and structure from xml to word with Apache poi?

I am trying to parse a table in the XML file defined by its HTML tags and generate a word document. The table structure and the content should be automatically generated in the word document. In order to parse XML with java, I am taking help of the Apache poi library. When I retrieve the values from the XML I don't see the HTML tags that are present or associated with the table structure. However without the corresponding tags in the XML I cannot create a corresponding table int the word document. How should I proceed in that case?

The XML that I am parsing has one field with values that are arranged in a table structure.

<customfield id="9999" key="com.atlassian.jira.plugin.system.customfieldtypes:textarea">
  <customfieldname>Product</customfieldname>
       <customfieldvalues>
          <customfieldvalue>
    &lt;div class=&apos;table-wrap&apos;&gt;
    &lt;table class=&apos;conTable&apos;&gt;&lt;tbody&gt;
    &lt;tr&gt;
    &lt;td class=&apos;confluenceTd&apos;&gt;&lt;ul&gt;
        &lt;li&gt;Product1:&lt;/li&gt;
    &lt;/ul&gt;
    &lt;/td&gt;
    &lt;td class=&apos;confluenceTd&apos;&gt;&lt;ul&gt;
        &lt;li&gt;Product2:&lt;/li&gt;
    &lt;/ul&gt;
    &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
    &lt;td class=&apos;confluenceTd&apos;&gt;&lt;ul&gt;
        &lt;li&gt;Product3;/li&gt;
        &lt;li&gt;Product4&lt;/li&gt;
    &lt;/ul&gt;
    &lt;/td&gt;
    &lt;td class=&apos;confluenceTd&apos;&gt;&lt;ul&gt;
        &lt;li&gt;Product5&lt;/li&gt;
        &lt;li&gt;Product6&lt;/li&gt;
    &lt;/ul&gt;
    &lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;&lt;/table&gt;
    &lt;/div&gt;
         </customfieldvalue>
     </customfieldvalues>
  </customfield>

The corresponding HTML is as follows

> <customfieldvalues>
>     <customfieldvalue> <div class='table-wrap'> <table class='confluenceTable'><tbody> <tr> <td class='confluenceTd'><ul>
> <li>Product1:</li> </ul> </td> <td class='confluenceTd'><ul>
> <li>Product2:</li> </ul> </td> </tr> <tr> <td
> class='confluenceTd'><ul> <li>Product3</li> <li>Product4</li> </ul>
> </td> <td class='confluenceTd'><ul> <li>Product5</li>
> <li>Product6</li> </ul> </td> </tr> </tbody></table> </div>    
> </customfieldvalue> </customfieldvalues>

I have parsed the XML normally to retrieve its value

element.item(n).getChildNodes().item(0).getNodeValue()

Solution

Here is a basic demo using Jsoup.

It assumes you have already extracted the text content from your <customfieldvalue>...</customfieldvalue> element.

So, now you have a string containing:

&lt;div class=&apos;table-wrap&apos;&gt; ... &lt;/div&gt;

To extract that content as a HTML document using Jsoup:

boolean strictMode = true;
String unescapedString = Parser.unescapeEntities(escapedString, strictMode);
Element element = Jsoup.parse(unescapedString).body();

You can iterate through all the child elements of this containing element:

for (Element element : Jsoup.parse(unescapedString).body().children().select("*")) {
    System.out.println(element.nodeName() + " - " + element.ownText());
}

In this case, all I am doing is printing each element with any data it contains.

The output is:

div - 
table - 
tbody - 
tr - 
td - 
ul - 
li - Product1:
td - 
ul - 
li - Product2:
tr - 
td - 
ul - 
li - Product3;/li>
li - Product4
td - 
ul - 
li - Product5
li - Product6

Interestingly, you can see that there is some malformed escaped HTML in the original data:

&lt;li&gt;Product3;/li&gt;

Once you have full access to the data-as-HTML, you can build your Word table using POI in the ususal way.