Search code examples
javaregexhref

Regex for broken links with whitespaces


i am using this Regex

private static final String HREF_PATTERN = 
    "\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";

to get the link from

 <a href=www.example.com/1234 5678>

The URL is malformed. It contains a whitespace. The Problem is that i want to get the whole link including "5678", but i only get "www.example.com/1234".

I am not that good with regular Expressions. Can someone please provide a valid regex so that i can get the whole url "www.example.com/1234 5678".

Thanks


Solution

  • The external program creates an html Email with several <a href=www.example.com/1234 5678> tags.

    Assuming you cannot fix it on the source level, you can try fixing that with a regex.

    If the href attribute is the only attribute, you just do not have to care about the spaces after =. Remove \\s from your pattern and it will work.

    private static final String HREF_PATTERN = 
       "(?i)\\s*href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">]+))";
                                                         ^
    

    If you have attributes with values, you will have to use a look-ahead:

    private static final String HREF_PATTERN = 
        (?i)\\s*href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">]+(?=>|\\s+\\w+=)))
    

    See the regex demo

    However, this will not work with attributes like nofollow.