I am trying to parse a table in the XML file defined by its HTML tags and generate a word document. The table structure and the content should be automatically generated in the word document. In order to parse XML with java, I am taking help of the Apache poi library. When I retrieve the values from the XML I don't see the HTML tags that are present or associated with the table structure. However without the corresponding tags in the XML I cannot create a corresponding table int the word document. How should I proceed in that case?
The XML that I am parsing has one field with values that are arranged in a table structure.
<customfield id="9999" key="com.atlassian.jira.plugin.system.customfieldtypes:textarea">
<customfieldname>Product</customfieldname>
<customfieldvalues>
<customfieldvalue>
<div class='table-wrap'>
<table class='conTable'><tbody>
<tr>
<td class='confluenceTd'><ul>
<li>Product1:</li>
</ul>
</td>
<td class='confluenceTd'><ul>
<li>Product2:</li>
</ul>
</td>
</tr>
<tr>
<td class='confluenceTd'><ul>
<li>Product3;/li>
<li>Product4</li>
</ul>
</td>
<td class='confluenceTd'><ul>
<li>Product5</li>
<li>Product6</li>
</ul>
</td>
</tr>
</tbody></table>
</div>
</customfieldvalue>
</customfieldvalues>
</customfield>
The corresponding HTML is as follows
> <customfieldvalues>
> <customfieldvalue> <div class='table-wrap'> <table class='confluenceTable'><tbody> <tr> <td class='confluenceTd'><ul>
> <li>Product1:</li> </ul> </td> <td class='confluenceTd'><ul>
> <li>Product2:</li> </ul> </td> </tr> <tr> <td
> class='confluenceTd'><ul> <li>Product3</li> <li>Product4</li> </ul>
> </td> <td class='confluenceTd'><ul> <li>Product5</li>
> <li>Product6</li> </ul> </td> </tr> </tbody></table> </div>
> </customfieldvalue> </customfieldvalues>
I have parsed the XML normally to retrieve its value
element.item(n).getChildNodes().item(0).getNodeValue()
Here is a basic demo using Jsoup.
It assumes you have already extracted the text content from your <customfieldvalue>...</customfieldvalue>
element.
So, now you have a string containing:
<div class='table-wrap'> ... </div>
To extract that content as a HTML document using Jsoup:
boolean strictMode = true;
String unescapedString = Parser.unescapeEntities(escapedString, strictMode);
Element element = Jsoup.parse(unescapedString).body();
You can iterate through all the child elements of this containing element:
for (Element element : Jsoup.parse(unescapedString).body().children().select("*")) {
System.out.println(element.nodeName() + " - " + element.ownText());
}
In this case, all I am doing is printing each element with any data it contains.
The output is:
div -
table -
tbody -
tr -
td -
ul -
li - Product1:
td -
ul -
li - Product2:
tr -
td -
ul -
li - Product3;/li>
li - Product4
td -
ul -
li - Product5
li - Product6
Interestingly, you can see that there is some malformed escaped HTML in the original data:
<li>Product3;/li>
Once you have full access to the data-as-HTML, you can build your Word table using POI in the ususal way.