Search code examples
javaapiapache-poidocumentdocx

Apache POI characters run for .docx


In .doc files, There is a function to get each character in paragraph by using

 CharacterRun charrun = paragraph.getCharacterRun(k++);

and then I can use those character runs to inspect their attributes like

if ( charrun.isBold() == true) System.out.print(charrun.text());

or something like that. But with .docx files seems to have no characters run method that can read each word like that, I tried to use

XWPFParagraph item = paragraph.get(i);
List<XWPFRun> charrun = item.getRuns();

I found that when you call the character run in XWPF, it won't return one character to you but it will return some random-in-length strings in the document

XWPFRun temp = charrun.get(0);
System.out.println(temp.gettext(0));

This code won't return 1st character in the paragraph.

So how can I fix this?


Solution

  • Assuming you want to iterate over all the (main) paragraphs in a word document (excluding tables, headers and the like), then iterate over the character runs in that paragraph, then iterate over the text of the run one character at a time, you'd want to do something like:

    XWPFDocument doc = new XWPFDocument(OPCPackage.open("myfile.docx"));
    for (XWPFParagraph paragraph : doc.getParagraphs()) {
        int pos = 0;
        for (XWPFRun run : paragraph.getRuns()) {
            for (character c : run.text().toCharArray()) {
                System.out.println("The character at " + pos + " is " + c);
                pos++;
            }
        }
    }
    

    That will iterate over each character, and will have things like tabs and newlines represented as their character equivalents (things like w:tab will be converted).

    For HWPF, the way of getting the paragraphs, and the way of getting the runs from a paragraph is similar but not identical, so there's no common interface. Both XWPFRun and HWPF's CharacterRun share a common interface though, so that part of the code can be re-used

    Note that all text in a given character run will share the same style / formatting information. Because of the strange ways that Word works, it's possible that two adjacent runs will also share the same styles, and Word hasn't merged them...