Search code examples
xmlms-wordopenxml

Unable to XPath parse the bullet point type in Word document using OpenXML


With the following MS Word document which only contains two bullet points of separate lists each encapsulated in one-cell tables.

Screenshot of Input Document

How do I use the Word document's underlying document.xml, numbering.xml, and styles.xml, to capture the type of bullet point (i.e., circle or square)? Reading the http://officeopenxml.com docs and other SO posts, I attempted the following to no avail:

  1. With document.xml, retrieve $num_id = w:numPr/w:numId/@w:val and $lvl_id = w:numPr/w:ilvl/@w:val values.

  2. With numbering.xml, using above $num_id value, retrieve $abs_id = w:num[@w:numId = $num_id]/w:abstractNumId/@w:val to return the corresponding value: w:abstractNum[@w:abstractNumId = $abs_id]/w:lvl[@w:ilvl = $lvl_id]/w:lvlText/@w:val

    However, this result is not correct as both return as square bullet.

  3. With styles.xml, review the ListParagraph w:style for any additional matching criteria.

    However, no unique identifiers or values appear useful. What am I missing?


See relevant section of the XML documents. Please advise if other sections or documents are relevant.

document.xml

            <w:p w14:paraId="16A4A39D"
                 w14:textId="10E79F44"
                 w:rsidR="00DB3D99"
                 w:rsidRPr="00D6457F"
                 w:rsidRDefault="00DB3D99"
                 w:rsidP="007205D3">
               <w:pPr>
                  <w:pStyle w:val="ListParagraph"/>
                  <w:keepNext/>
                  <w:numPr>
                     <w:ilvl w:val="0"/>
                     <w:numId w:val="5"/>
                  </w:numPr>
                  <w:spacing w:before="80" w:after="80"/>
                  <w:contextualSpacing w:val="0"/>
                  <w:rPr>
                     <w:rFonts w:ascii="Franklin Gothic Book" w:hAnsi="Franklin Gothic Book"/>
                     <w:bCs/>
                     <w:sz w:val="20"/>
                     <w:szCs w:val="20"/>
                  </w:rPr>
               </w:pPr>
               <w:r w:rsidRPr="00DB3D99">
                  <w:rPr>
                     <w:rFonts w:ascii="Franklin Gothic Book" w:hAnsi="Franklin Gothic Book"/>
                     <w:bCs/>
                     <w:sz w:val="20"/>
                     <w:szCs w:val="20"/>
                  </w:rPr>
                  <w:t>Mainstreaming environmental considerations into social and economic decisions at all levels is of vital importance</w:t>
               </w:r>
            </w:p>

 ...
            <w:p w14:paraId="79FEF50C"
                 w14:textId="65464CBE"
                 w:rsidR="009C1A5F"
                 w:rsidRPr="009C1A5F"
                 w:rsidRDefault="009C1A5F"
                 w:rsidP="009C1A5F">
               <w:pPr>
                  <w:pStyle w:val="ListParagraph"/>
                  <w:keepNext/>
                  <w:numPr>
                     <w:ilvl w:val="0"/>
                     <w:numId w:val="9"/>
                  </w:numPr>
                  <w:spacing w:before="80" w:after="80"/>
                  <w:rPr>
                     <w:rFonts w:ascii="Franklin Gothic Book" w:hAnsi="Franklin Gothic Book"/>
                     <w:sz w:val="20"/>
                     <w:szCs w:val="20"/>
                  </w:rPr>
               </w:pPr>
               <w:r w:rsidRPr="009C1A5F">
                  <w:rPr>
                     <w:rFonts w:ascii="Franklin Gothic Book" w:hAnsi="Franklin Gothic Book"/>
                     <w:bCs/>
                     <w:sz w:val="20"/>
                     <w:szCs w:val="20"/>
                  </w:rPr>
                  <w:t>Solutions need to seek an integrated approach that simultaneously address the conservation of the planet’s genetic diversity, species and ecosystems</w:t>
               </w:r>
            </w:p>

numbering.xml

<w:abstractNum w:abstractNumId="0" w15:restartNumberingAfterBreak="0">
      <w:nsid w:val="037970D6"/>
      <w:multiLevelType w:val="hybridMultilevel"/>
      <w:tmpl w:val="98A2E35C"/>
      <w:lvl w:ilvl="0" w:tplc="E7067EF0">
         <w:start w:val="1"/>
         <w:numFmt w:val="bullet"/>
         <w:lvlText w:val=""/>
         <w:lvlJc w:val="left"/>
         <w:pPr>
            <w:ind w:left="360" w:hanging="360"/>
         </w:pPr>
         <w:rPr>
            <w:rFonts w:ascii="Wingdings 2" w:hAnsi="Wingdings 2" w:hint="default"/>
         </w:rPr>
      </w:lvl>
   ...
   </w:abstractNum>
...
   <w:abstractNum w:abstractNumId="8" w15:restartNumberingAfterBreak="0">
      <w:nsid w:val="6DA523B5"/>
      <w:multiLevelType w:val="hybridMultilevel"/>
      <w:tmpl w:val="D0A2943E"/>
      <w:lvl w:ilvl="0" w:tplc="CBCE2CF0">
         <w:start w:val="1"/>
         <w:numFmt w:val="bullet"/>
         <w:lvlText w:val=""/>
         <w:lvlJc w:val="left"/>
         <w:pPr>
            <w:ind w:left="360" w:hanging="360"/>
         </w:pPr>
         <w:rPr>
            <w:rFonts w:ascii="Wingdings 2" w:hAnsi="Wingdings 2" w:hint="default"/>
         </w:rPr>
      </w:lvl>
   ...
   </w:abstractNum>
...
   <w:num w:numId="5" w16cid:durableId="963343858">
      <w:abstractNumId w:val="0"/>
   </w:num>
   ...
   <w:num w:numId="9" w16cid:durableId="324748400">
      <w:abstractNumId w:val="8"/>
   </w:num>

styles.xml

<w:style w:type="paragraph" w:styleId="ListParagraph">
  <w:name w:val="List Paragraph"/>
  <w:basedOn w:val="Normal"/>
  <w:link w:val="ListParagraphChar"/>
  <w:uiPriority w:val="34"/>
  <w:qFormat/>
  <w:rsid w:val="007205D3"/>
  <w:pPr>
     <w:ind w:left="720"/>
     <w:contextualSpacing/>
  </w:pPr>
</w:style>

To show my actual implementation of XPath, I am actually attempting XSLT that transforms document.xml (making document reference to numbering.xml) using PowerShell to identify all text and symbol of bullet points in output.

style.xsl

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                              xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
 <xsl:output encoding="UTF-8" omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
    <data>
        <xsl:apply-templates select="descendant::w:tbl"/>
    </data>
 </xsl:template>

 <xsl:template match="w:tbl">
    <xsl:apply-templates select="descendant::w:p[descendant::w:t != '']"/>
 </xsl:template>

 <xsl:template match="w:p">
    <xsl:variable name="num_id" select="w:pPr/w:numPr/w:numId/@w:val"/>
    <xsl:variable name="lvl_id" select="w:pPr/w:numPr/w:ilvl/@w:val"/>
    <xsl:variable name="abs_id" select="document('numbering.xml')/w:numbering/
                                         w:num[@w:numId = $num_id]/w:abstractNumId/@w:val" />
    <xsl:variable name="num_val" select="document('numbering.xml')/w:numbering/
                                         w:abstractNum[@w:abstractNumId = $abs_id]/
                                         w:lvl[@w:ilvl = $lvl_id]/w:lvlText/@w:val"/>
    <xsl:variable name="square_bullet"><![CDATA[&#61569;]]></xsl:variable>
    <xsl:variable name="circle_bullet"><![CDATA[&#61603;]]></xsl:variable>
    <row>
        <text>
            <xsl:value-of select="."/>
        </text>
        <symbol>
            <xsl:value-of select="$num_val"/>
        </symbol>
        <type>
            <xsl:choose>
                <xsl:when test="$num_val = $square_bullet">
                    <xsl:text>Checkbox</xsl:text>
                </xsl:when>
                <xsl:when test="$num_val = $circle_bullet">
                    <xsl:text>Radio</xsl:text>
                </xsl:when>
                <xsl:otherwise>Text</xsl:otherwise>
            </xsl:choose>
        </type>
    </row>
 </xsl:template>

 <xsl:template match="text()">
    <xsl:value-of select='normalize-space()'/>
 </xsl:template>
 
</xsl:stylesheet>

transform.ps1

$xslSettings = New-Object System.Xml.Xsl.XsltSettings($true, $false);
$xmlResolver = New-Object System.Xml.XmlUrlResolver;

$xslt = New-Object System.Xml.Xsl.XslCompiledTransform;

$xslt.Load("style.xsl", $xslSettings, $xmlResolver);
$xslt.Transform("document.xml", "output.xml");

output.xml

<data xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <row>
    <text>Mainstreaming environmental considerations into social and economic decisions at all levels is of vital importance</text>
    <symbol></symbol>
    <type>Text</type>
  </row>
  <row>
    <text>Solutions need to seek an integrated approach that simultaneously address the conservation of the planet’s genetic diversity, species and ecosystems</text>
    <symbol></symbol>
    <type>Text</type>
  </row>
</data>

Solution

  • In your example the <w:lvlText w:val=""/> <w:lvlText w:val=""/> are in xml visible looking the same, but they are not.

    The first one <w:lvlText w:val=""/> holds U+F081

    The second one <w:lvlText w:val=""/> holds U+F0A3

    If I put both with <w:rFonts w:ascii="Wingdings 2" w:hAnsi="Wingdings 2" w:hint="default"/> in the appropriate w:abstractNum/w:lvl/w:rPr I get your result as well

    So to conclude; in font Wingdings 2 these chars U+F081 and U+F0A3 are pointing to een open circle and open square.

    And so is your XPath strategy towards these characters correct.

    EDIT These special characters may appear in xml as some form of rectangular shape...but that is just a way of displaying undisplayable characters.
    In i.e. BBEdit (on MacOs) you have the option to view the bytes of a text-file as HEX-codes. In this way you are able to view the private unicodes See i.e. this question for some more info on the way Windows handles private unicodes.

    I don't know if it is possible to display those actual bullets inside xml using xslt, since it is a combination between a font and unicode. I suppose you would need to format it using i.e. css to actually show it correctly.