Search code examples
javahtmlstringjtextpane

How to clean JTextPanes/JEditorPanes html content to string in Java?


I try to get pretty (cleaned) text content from JTextPane. Here is example code from JTextPane:

JTextPane textPane = new JTextPane ();
textPane.setContentType ("text/html");
textPane.setText ("This <b>is</b> a <b>test</b>.");
String text = textPane.getText ();
System.out.println (text);

Text look like this in JTexPane:

This is a test.

I get this kind of print to console:

<html>
  <head>

  </head>
  <body>
    This <b>is</b> a <b>test</b>.
  </body>
</html>

I've used substring() and/or replace() code, but it is uncomfortable to use:

String text = textPane.getText ().replace ("<html> ... <body>\n    , "");

Is there any simple function to remove all other tags than <b> tags (content) from string?

Sometimes JTextPane add <p> tags around content so I want to get rid of them also.

Like this:

<html>
  <head>

  </head>
  <body>
    <p style="margin-top: 0">
      hdfhdfgh
    </p>
  </body>
</html>

I want to get only text content with tags:

This <b>is</b> a <b>test</b>.

Solution

  • I subclassed HTMLWriter and overrode startTag and endTag to skip all tags outside of <body>.

    I did not test much, it seems to work ok. One drawback is that the output string has quite a lot of whitespace. Getting rid of that shouldn't be too hard.

    import java.io.*;
    import javax.swing.*;
    import javax.swing.text.*;
    import javax.swing.text.html.*;
    
    public class Foo {
    
        public static void main(String[] args) throws Exception {
            JTextPane textPane = new JTextPane();
            textPane.setContentType("text/html");
            textPane.setText("<p>This</p> <b>is</b> a <b>test</b>.");
    
            StringWriter writer = new StringWriter();
            HTMLDocument doc = (HTMLDocument) textPane.getStyledDocument();
    
            HTMLWriter htmlWriter = new OnlyBodyHTMLWriter(writer, doc);
            htmlWriter.write();
    
            System.out.println(writer.toString());
        }
    
        private static class OnlyBodyHTMLWriter extends HTMLWriter {
    
            public OnlyBodyHTMLWriter(Writer w, HTMLDocument doc) {
                super(w, doc);
            }
    
            private boolean inBody = false;
    
            private boolean isBody(Element elem) {
                // copied from HTMLWriter.startTag()
                AttributeSet attr = elem.getAttributes();
                Object nameAttribute = attr
                        .getAttribute(StyleConstants.NameAttribute);
                HTML.Tag name = null;
                if (nameAttribute instanceof HTML.Tag) {
                    name = (HTML.Tag) nameAttribute;
                }
                return name == HTML.Tag.BODY;
            }
    
            @Override
            protected void startTag(Element elem) throws IOException,
                    BadLocationException {
                if (inBody) {
                    super.startTag(elem);
                }
                if (isBody(elem)) {
                    inBody = true;
                }
            }
    
            @Override
            protected void endTag(Element elem) throws IOException {
                if (isBody(elem)) {
                    inBody = false;
                }
                if (inBody) {
                    super.endTag(elem);
                }
            }
        }
    }