Search code examples
javahtmlxmlms-wordapache-poi

How to parse a table content and structure from xml to word with Apache poi?


I am trying to parse a table in the XML file defined by its HTML tags and generate a word document. The table structure and the content should be automatically generated in the word document. In order to parse XML with java, I am taking help of the Apache poi library. When I retrieve the values from the XML I don't see the HTML tags that are present or associated with the table structure. However without the corresponding tags in the XML I cannot create a corresponding table int the word document. How should I proceed in that case?

The XML that I am parsing has one field with values that are arranged in a table structure.

<customfield id="9999" key="com.atlassian.jira.plugin.system.customfieldtypes:textarea">
  <customfieldname>Product</customfieldname>
       <customfieldvalues>
          <customfieldvalue>
    &lt;div class=&apos;table-wrap&apos;&gt;
    &lt;table class=&apos;conTable&apos;&gt;&lt;tbody&gt;
    &lt;tr&gt;
    &lt;td class=&apos;confluenceTd&apos;&gt;&lt;ul&gt;
        &lt;li&gt;Product1:&lt;/li&gt;
    &lt;/ul&gt;
    &lt;/td&gt;
    &lt;td class=&apos;confluenceTd&apos;&gt;&lt;ul&gt;
        &lt;li&gt;Product2:&lt;/li&gt;
    &lt;/ul&gt;
    &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
    &lt;td class=&apos;confluenceTd&apos;&gt;&lt;ul&gt;
        &lt;li&gt;Product3;/li&gt;
        &lt;li&gt;Product4&lt;/li&gt;
    &lt;/ul&gt;
    &lt;/td&gt;
    &lt;td class=&apos;confluenceTd&apos;&gt;&lt;ul&gt;
        &lt;li&gt;Product5&lt;/li&gt;
        &lt;li&gt;Product6&lt;/li&gt;
    &lt;/ul&gt;
    &lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;&lt;/table&gt;
    &lt;/div&gt;
         </customfieldvalue>
     </customfieldvalues>
  </customfield>

The corresponding HTML is as follows

> <customfieldvalues>
>     <customfieldvalue> <div class='table-wrap'> <table class='confluenceTable'><tbody> <tr> <td class='confluenceTd'><ul>
> <li>Product1:</li> </ul> </td> <td class='confluenceTd'><ul>
> <li>Product2:</li> </ul> </td> </tr> <tr> <td
> class='confluenceTd'><ul> <li>Product3</li> <li>Product4</li> </ul>
> </td> <td class='confluenceTd'><ul> <li>Product5</li>
> <li>Product6</li> </ul> </td> </tr> </tbody></table> </div>    
> </customfieldvalue> </customfieldvalues>

I have parsed the XML normally to retrieve its value

element.item(n).getChildNodes().item(0).getNodeValue()

Solution

  • Here is a basic demo using Jsoup.

    It assumes you have already extracted the text content from your <customfieldvalue>...</customfieldvalue> element.

    So, now you have a string containing:

    &lt;div class=&apos;table-wrap&apos;&gt; ... &lt;/div&gt;
    

    To extract that content as a HTML document using Jsoup:

    boolean strictMode = true;
    String unescapedString = Parser.unescapeEntities(escapedString, strictMode);
    Element element = Jsoup.parse(unescapedString).body();
    

    You can iterate through all the child elements of this containing element:

    for (Element element : Jsoup.parse(unescapedString).body().children().select("*")) {
        System.out.println(element.nodeName() + " - " + element.ownText());
    }
    

    In this case, all I am doing is printing each element with any data it contains.

    The output is:

    div - 
    table - 
    tbody - 
    tr - 
    td - 
    ul - 
    li - Product1:
    td - 
    ul - 
    li - Product2:
    tr - 
    td - 
    ul - 
    li - Product3;/li>
    li - Product4
    td - 
    ul - 
    li - Product5
    li - Product6
    

    Interestingly, you can see that there is some malformed escaped HTML in the original data:

    &lt;li&gt;Product3;/li&gt;
    

    Once you have full access to the data-as-HTML, you can build your Word table using POI in the ususal way.