Search code examples
javaasposeaspose.words

Extract word document comments and the text they comment on


I need to extract word document comments and the text they comment on. Below is my current solution, but it is not working as expcted

public class Main {

    public static void main(String[] args) throws Exception {
        var document = new Document("sample.docx");
        NodeCollection<Paragraph> paragraphs = document.getChildNodes(PARAGRAPH, true);
        List<MyComment> myComments = new ArrayList<>();

        for (Paragraph paragraph : paragraphs) {
            var comments = getComments(paragraph);
            int commentIndex = 0;

            if (comments.isEmpty()) continue;

            for (Run run : paragraph.getRuns()) {
                var runText = run.getText();

                for (int i = commentIndex; i < comments.size(); i++) {
                    Comment comment = comments.get(i);
                    String commentText = comment.getText();

                    if (paragraph.getText().contains(runText + commentText)) {
                        myComments.add(new MyComment(runText, commentText));
                        commentIndex++;
                        break;
                    }
                }
            }
        }

        myComments.forEach(System.out::println);
    }

    private static List<Comment> getComments(Paragraph paragraph) {
        @SuppressWarnings("unchecked")
        NodeCollection<Comment> comments = paragraph.getChildNodes(COMMENT, false);
        List<Comment> commentList = new ArrayList<>();

        comments.forEach(commentList::add);

        return commentList;
    }

    static class MyComment {
        String text;
        String commentText;

        public MyComment(String text, String commentText) {
            this.text = text;
            this.commentText = commentText;
        }

        @Override
        public String toString() {
            return text + "-->" + commentText;
        }
    }
}

sample.docx contents are: enter image description here

And the output is (which is incorrect):

factors-->This is word comment
%–10% of cancers are caused by inherited genetic defects from a person's parents.-->Second paragraph comment

Expected output is:

factors-->This is word comment
These factors act, at least partly, by changing the genes of a cell. Typically, many genetic changes are required before cancer develops. Approximately 5%–10% of cancers are caused by inherited genetic defects from a person's parents.-->Second paragraph comment
These factors act, at least partly, by changing the genes of a cell. Typically, many genetic changes are required before cancer develops. Approximately 5%–10% of cancers are caused by inherited genetic defects from a person's parents.-->First paragraph comment

Please help me with a better way of extarcting word document comments and the text they comment on. If you need additional details let me know, I will provide all the required details


Solution

  • The commented text is marked by special nodes CommentRangeStart and CommentRangeEnd. CommentRangeStart and CommentRangeEnd nodes has Id, which corresponds the Comment id the range is linked to. So you need to extract content between the corresponding start and end nodes. By the way, the code example in the Aspose.Words API reference shows how print the contents of all comments and their comment ranges using a document visitor. Looks like exactly what you are looking for.

    EDIT: You can use code like the following to accomplish your task. I did not provide full code for extracting content between nodes, is is availabel on GitHub

    Document doc = new Document("C:\\Temp\\in.docx");
    
    // Get the comments in the document.
    Iterable<Comment> comments = doc.getChildNodes(NodeType.COMMENT, true);
    Iterable<CommentRangeStart> commentRangeStarts = doc.getChildNodes(NodeType.COMMENT_RANGE_START, true);
    Iterable<CommentRangeEnd> commentRangeEnds = doc.getChildNodes(NodeType.COMMENT_RANGE_END, true);
    
    for (Comment c : comments)
    {
        System.out.println(String.format("Comment %d : %s", c.getId(), c.toString(SaveFormat.TEXT)));
    
        CommentRangeStart start = null;
        CommentRangeEnd end = null;
    
        // Search for an appropriate start and end.
        for (CommentRangeStart s : commentRangeStarts)
        {
            if (c.getId() == s.getId())
            {
                start = s;
                break;
            }
        }
    
        for (CommentRangeEnd e : commentRangeEnds)
        {
            if (c.getId() == e.getId())
            {
                end = e;
                break;
            }
        }
    
        if (start != null && end != null)
        {
            // Extract content between the start and end nodes.
            // Code example how to extract content between nodes is here
            // https://github.com/aspose-words/Aspose.Words-for-Java/blob/master/Examples/src/main/java/com/aspose/words/examples/programming_documents/document/ExtractContentBetweenCommentRange.java
        }
        else
        {
            System.out.println(String.format("Comment %d Does not have comment range"));
        }
    
    }