I try to get pretty (cleaned) text content from JTextPane. Here is example code from JTextPane
:
JTextPane textPane = new JTextPane ();
textPane.setContentType ("text/html");
textPane.setText ("This <b>is</b> a <b>test</b>.");
String text = textPane.getText ();
System.out.println (text);
Text look like this in JTexPane
:
This is a test.
I get this kind of print to console:
<html>
<head>
</head>
<body>
This <b>is</b> a <b>test</b>.
</body>
</html>
I've used substring()
and/or replace()
code, but it is uncomfortable to use:
String text = textPane.getText ().replace ("<html> ... <body>\n , "");
Is there any simple function to remove all other tags than <b>
tags (content) from string?
Sometimes JTextPane
add <p>
tags around content so I want to get rid of them also.
Like this:
<html>
<head>
</head>
<body>
<p style="margin-top: 0">
hdfhdfgh
</p>
</body>
</html>
I want to get only text content with tags:
This <b>is</b> a <b>test</b>.
I subclassed HTMLWriter
and overrode startTag
and endTag
to skip all tags outside of <body>
.
I did not test much, it seems to work ok. One drawback is that the output string has quite a lot of whitespace. Getting rid of that shouldn't be too hard.
import java.io.*;
import javax.swing.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
public class Foo {
public static void main(String[] args) throws Exception {
JTextPane textPane = new JTextPane();
textPane.setContentType("text/html");
textPane.setText("<p>This</p> <b>is</b> a <b>test</b>.");
StringWriter writer = new StringWriter();
HTMLDocument doc = (HTMLDocument) textPane.getStyledDocument();
HTMLWriter htmlWriter = new OnlyBodyHTMLWriter(writer, doc);
htmlWriter.write();
System.out.println(writer.toString());
}
private static class OnlyBodyHTMLWriter extends HTMLWriter {
public OnlyBodyHTMLWriter(Writer w, HTMLDocument doc) {
super(w, doc);
}
private boolean inBody = false;
private boolean isBody(Element elem) {
// copied from HTMLWriter.startTag()
AttributeSet attr = elem.getAttributes();
Object nameAttribute = attr
.getAttribute(StyleConstants.NameAttribute);
HTML.Tag name = null;
if (nameAttribute instanceof HTML.Tag) {
name = (HTML.Tag) nameAttribute;
}
return name == HTML.Tag.BODY;
}
@Override
protected void startTag(Element elem) throws IOException,
BadLocationException {
if (inBody) {
super.startTag(elem);
}
if (isBody(elem)) {
inBody = true;
}
}
@Override
protected void endTag(Element elem) throws IOException {
if (isBody(elem)) {
inBody = false;
}
if (inBody) {
super.endTag(elem);
}
}
}
}