Search code examples
javaregexvaadin7rich-text-editor

Unable to parse string with a dot in Java with REGEX


When copy and pasting content from a word document into a Vaadin7 RichTextArea (or any other Richtextfield), there are plenty of unwanted HTML tags and attributes. Since in a current project the attribute width does some funny business, I'd like to remove them with the following funtion

private String cleanUpHTMLcontent(String content) {
    LOG.log(Level.INFO, "Cleaning up that rubbish now");

    content = content.replaceAll("width=\"[0-9]*\"",""); // this works fine
    content = content.replaceAll("width:[0-9]*[\\.|]*[0-9]*pt;",""); // not working
    content = content.replaceAll(";width:[0-9]*[\\.|]*[0-9]*pt",""); // not working
    content = content.replaceAll("width:[0-9]*[\\.|]*[0-9]*pt",""); // not working
    return content; 
}

The first line works fine to remove old html tags like width="500", the other lines are going into the style attribute and try to remove the properties like width:300.45pt; with different positions of the colon.

The code works fine on the test page http://www.regexplanet.com/advanced/java/index.html . I generated my regex strings here, specially for java, but it's still not working. Anyone any idea?

Here an example where it doesn't find the width property

 td style="width:453.1pt;border:solid windowtext 1.0pt; 

UPDATE

    content = content.replaceAll("width:\\s*[.0-9]*pt;",""); // doesn't work
    content = content.replaceAll(";width:\\s*[.0-9]*pt",""); // doesn't work
    content = content.replaceAll("width:\\s*[.0-9]*pt",""); // works :-)

it appears, that I have to escape the semi-colon as well with a backslash? I will test that


Solution

  • To remove any number of digits with a dot you can use a negated character class [.\d]* or [.0-9]*:

    "\\bwidth:\\s*[.0-9]*pt;"
    

    See the regex demo

    The \b is a word boundary (makes sure we only match width as a whole word).

    Details:

    • \b - leading word boundary
    • width: - literal string width:
    • \s* - 0+ whitespace symbols
    • [.0-9]* - 0+ dots or digits
    • pt; - literal pt;