Search code examples
javaapache-poixwpf

Removing an XWPFParagraph keeps the paragraph symbol (¶) for it


I am trying to remove a set of contiguous paragraphs from a Microsoft Word document, using Apache POI.

From what I have understood, deleting a paragraph is possible by removing all of its runs, this way:

/*
 * Deletes the given paragraph.
 */
public static void deleteParagraph(XWPFParagraph p) {
    if (p != null) {
        List<XWPFRun> runs = p.getRuns();
        //Delete all the runs
        for (int i = runs.size() - 1; i >= 0; i--) {
            p.removeRun(i);
        }
        p.setPageBreak(false); //Remove the eventual page break
    }
}

In fact, it works, but there's something strange. The block of removed paragraphs does not disappear from the document, but it's converted in a set of empty lines. It's just like every paragraph would be converted into a new line.

By printing the paragraphs' content from code I can see, in fact, a space (for each one removed). Looking at the content directly from the document, with the formatting mark's visualization enabled, I can see this:

enter image description here

The vertical column of ¶ corresponds to the block of deleted elements.

Do you have an idea for that? I'd like my paragraphs to be completely removed.

I also tried by replacing the text (with setText()) and by removing eventual spaces that could be added automatically, this way:

p.setSpacingAfter(0);
p.setSpacingAfterLines(0);
p.setSpacingBefore(0);
p.setSpacingBeforeLines(0);
p.setIndentFromLeft(0);
p.setIndentFromRight(0);
p.setIndentationFirstLine(0);
p.setIndentationLeft(0);
p.setIndentationRight(0);

But with no luck.


Solution

  • I would delete paragraphs by deleting paragraphs, not by deleting only the runs in this paragraphs. Deleting paragraphs is not part of the apache poi high level API. But using XWPFDocument.getDocument().getBody() we can get the low level CTBody and there is a removeP(int i).

    Example:

    import java.io.*;
    import org.apache.poi.xwpf.usermodel.*;
    
    import java.awt.Desktop;
    
    import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
    
    public class WordRemoveParagraph {
    
     /*
      * Deletes the given paragraph.
      */
    
     public static void deleteParagraph(XWPFParagraph p) {
      XWPFDocument doc = p.getDocument();
      int pPos = doc.getPosOfParagraph(p);
      //doc.getDocument().getBody().removeP(pPos);
      doc.removeBodyElement(pPos);
     }
    
     public static void main(String[] args) throws IOException, InvalidFormatException {
    
      XWPFDocument doc = new XWPFDocument(new FileInputStream("source.docx"));
    
      int pNumber = doc.getParagraphs().size() -1;
      while (pNumber >= 0) {
       XWPFParagraph p = doc.getParagraphs().get(pNumber);
       if (p.getParagraphText().contains("delete")) {
        deleteParagraph(p);
       }
       pNumber--;
      }
    
      FileOutputStream out = new FileOutputStream("result.docx");
      doc.write(out);
      out.close();
      doc.close();
    
      System.out.println("Done");
      Desktop.getDesktop().open(new File("result.docx"));
    
     }
    
    }
    

    This deletes all paragraphs from the document source.docx where the text contains "delete" and saves the result in result.docx.


    Edited:

    Although doc.getDocument().getBody().removeP(pPos); works, it will not update the XWPFDocument's paragraphs list. So it will destroy paragraph iterators and other accesses to that list since the list is only updated while reading the document again.

    So the better approach is using doc.removeBodyElement(pPos); instead. removeBodyElement(int pos) does exactly the same as doc.getDocument().getBody().removeP(pos); if the pos is pointing to a pagagraph in the document body since that paragraph is an BodyElement too. But in addition, it will update the XWPFDocument's paragraphs list.