When copy and pasting content from a word document into a Vaadin7 RichTextArea (or any other Richtextfield), there are plenty of unwanted HTML tags and attributes. Since in a current project the attribute width does some funny business, I'd like to remove them with the following funtion
private String cleanUpHTMLcontent(String content) {
LOG.log(Level.INFO, "Cleaning up that rubbish now");
content = content.replaceAll("width=\"[0-9]*\"",""); // this works fine
content = content.replaceAll("width:[0-9]*[\\.|]*[0-9]*pt;",""); // not working
content = content.replaceAll(";width:[0-9]*[\\.|]*[0-9]*pt",""); // not working
content = content.replaceAll("width:[0-9]*[\\.|]*[0-9]*pt",""); // not working
return content;
}
The first line works fine to remove old html tags like width="500"
, the other lines are going into the style attribute and try to remove the properties like width:300.45pt;
with different positions of the colon.
The code works fine on the test page http://www.regexplanet.com/advanced/java/index.html . I generated my regex strings here, specially for java, but it's still not working. Anyone any idea?
Here an example where it doesn't find the width property
td style="width:453.1pt;border:solid windowtext 1.0pt;
UPDATE
content = content.replaceAll("width:\\s*[.0-9]*pt;",""); // doesn't work
content = content.replaceAll(";width:\\s*[.0-9]*pt",""); // doesn't work
content = content.replaceAll("width:\\s*[.0-9]*pt",""); // works :-)
it appears, that I have to escape the semi-colon as well with a backslash? I will test that
To remove any number of digits with a dot you can use a negated character class [.\d]*
or [.0-9]*
:
"\\bwidth:\\s*[.0-9]*pt;"
See the regex demo
The \b
is a word boundary (makes sure we only match width
as a whole word).
Details:
\b
- leading word boundarywidth:
- literal string width:
\s*
- 0+ whitespace symbols[.0-9]*
- 0+ dots or digitspt;
- literal pt;