I'd like to create a regex that will match an opening <a>
tag containing an href attribute only:
<a href="doesntmatter.com">
It should match the above, but not match when other attributes are added:
<a href="doesntmatter.com" onmouseover="alert('Do something evil with Javascript')">
Normally that would be pretty easy, but the HTML is encoded. So encoding both of the above, I need the regex to match this:
<a href="doesntmatter.com" >
But not match this:
<a href="doesntmatter.com" onmouseover="alert('do something evil with javascript.')" >
Assume all encoded HTML is "valid" (no weird malformed XSS trickery) and assume that we don't need to follow any HTML sanitization best practices. I just need the simplest regex that will match A) above but not B).
Thanks!
I don't see how matching one is different from the other? You're just looking for exactly what you just wrote, making the portion that is doesntmatter.com
the part you capture. I guess matching for anything until "
(not "
?) can present a problem, but you do it like this in regex:
(?:(?!").)*
It essentially means:
"""
The complete regular expression would be:
/<a href="(?>(?:[^&]+|(?!").)*)">/s
This is more efficient than using a non-greedy expression.
Credit to Daniel Vandersluis for reminding me of the atomic group! It fits nicely here for the sake of optimization (this pattern can never match if it has to backtrack.)
I also threw in an additional [^&]+
group to avoid repeating the negative look-ahead so many times.
Alternatively, one could use a possessive quantifier, which essentially does the same thing (your regex engine might not support it):
/<a href="(?:[^&]+|(?!").)*+">/s
As you can see it's slightly shorter.