Search code examples
javaapache-poiapache-tikaasposeaspose.words

How to parse style separated paragraphs of MS Word in Aspose or Apache Poi?


the ms word document has multi-styled paragraphs, normally per paragraphs has one style but you can combine two and more styled text on one paragraph with style separator tool. So how to get child styles and text contets of style separated paragraphs from root parahraph using Aspose Words, Apache Poi or others? enter image description here


Solution

  • Style separator actually is normal paragraph break but with special attributes set. So you can consider content separated by style separator as two separate paragraphs.

    <w:p w14:paraId="561A87F3" w14:textId="0D47DD82" w:rsidR="00AB32A0" w:rsidRPr="00AB32A0" w:rsidRDefault="00AB32A0" w:rsidP="00AB32A0">
      <w:pPr>
        <w:pStyle w:val="Heading1" />
        <w:rPr>
          <w:vanish />
          <w:specVanish />
        </w:rPr>
      </w:pPr>
      <w:r w:rsidRPr="00AB32A0">
        <w:rPr>
          <w:rStyle w:val="Heading1Char" />
        </w:rPr>
        <w:t>Test heading1</w:t>
      </w:r>
    </w:p>
    <w:p w14:paraId="0982566B" w14:textId="76E92742" w:rsidR="00391656" w:rsidRDefault="00AB32A0" w:rsidP="00AB32A0">
      <w:r>
        <w:t xml:space="preserve"> test paragraph.</w:t>
      </w:r>
    </w:p>
    

    The following two attributes indicate that the paragraph break is style separator

    <w:rPr>
      <w:vanish />
      <w:specVanish />
    </w:rPr>
    

    In Aspose.Words you can detect whether paragraph break is style separator by Paragraph.BreakIsStyleSeparator property.

    cs:

    Document doc = new Document(@"C:\Temp\test.docx");
    foreach (Paragraph para in doc.FirstSection.Body.Paragraphs)
    {
        Console.WriteLine("Style Name: {0}; Is Style Separator: {1}; Content: {2}", para.ParagraphFormat.StyleName, para.BreakIsStyleSeparator, para.ToString(SaveFormat.Text));
    }
    

    java:

    Document doc = new Document("C:/Temp/test.docx");
    for(Paragraph para : doc.getFirstSection().getBody().getParagraphs()){
       String styleName = para.getParagraphFormat().getStyleName();
       boolean isStyleSeparator = para.getBreakIsStyleSeparator();
       String content = para.toString(SaveFormat.TEXT);
    }
    

    Disclosure: I work at Aspose.Words team.