i am using this Regex
private static final String HREF_PATTERN =
"\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";
to get the link from
<a href=www.example.com/1234 5678>
The URL is malformed. It contains a whitespace. The Problem is that i want to get the whole link including "5678", but i only get "www.example.com/1234".
I am not that good with regular Expressions. Can someone please provide a valid regex so that i can get the whole url "www.example.com/1234 5678".
Thanks
The external program creates an html Email with several
<a href=www.example.com/1234 5678>
tags.
Assuming you cannot fix it on the source level, you can try fixing that with a regex.
If the href
attribute is the only attribute, you just do not have to care about the spaces after =
. Remove \\s
from your pattern and it will work.
private static final String HREF_PATTERN =
"(?i)\\s*href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">]+))";
^
If you have attributes with values, you will have to use a look-ahead:
private static final String HREF_PATTERN =
(?i)\\s*href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">]+(?=>|\\s+\\w+=)))
See the regex demo
However, this will not work with attributes like nofollow
.