html regex validation sanitization html-encode

Regex for Encoded HTML

I'd like to create a regex that will match an opening <a> tag containing an href attribute only:

<a href="doesntmatter.com">

It should match the above, but not match when other attributes are added:

<a href="doesntmatter.com" onmouseover="alert('Do something evil with Javascript')">

Normally that would be pretty easy, but the HTML is encoded. So encoding both of the above, I need the regex to match this:

&#60;a href&#61;&#34;doesntmatter.com&#34; &#62;

But not match this:

&#60;a href&#61;&#34;doesntmatter.com&#34; onmouseover&#61;&#34;alert&#40;&#39;do something evil with javascript.&#39;&#41;&#34; &#62;

Assume all encoded HTML is "valid" (no weird malformed XSS trickery) and assume that we don't need to follow any HTML sanitization best practices. I just need the simplest regex that will match A) above but not B).

Thanks!

Solution

I don't see how matching one is different from the other? You're just looking for exactly what you just wrote, making the portion that is doesntmatter.com the part you capture. I guess matching for anything until " (not "?) can present a problem, but you do it like this in regex:

(?:(?!").)*

It essentially means:

Match the following group 0 or more times
- Fail match if the following string is """
- Match any character (except new line unless DOTALL is specified)

The complete regular expression would be:

/&#60;a href&#61;&#34;(?>(?:[^&]+|(?!&#34;).)*)&#34;&#62;/s

This is more efficient than using a non-greedy expression.

Credit to Daniel Vandersluis for reminding me of the atomic group! It fits nicely here for the sake of optimization (this pattern can never match if it has to backtrack.)

I also threw in an additional [^&]+ group to avoid repeating the negative look-ahead so many times.

Alternatively, one could use a possessive quantifier, which essentially does the same thing (your regex engine might not support it):

/&#60;a href&#61;&#34;(?:[^&]+|(?!&#34;).)*+&#34;&#62;/s

As you can see it's slightly shorter.