preg_match link text with less-than sign in it

I'm trying to get information in DB from html files, and suddenly found that link can be like:

<a href="/blabla/12345678" class="someclass">channel crosstalk: <60dB</a>

there for my regular expression doesn't find that link:

preg_match_all('|<a href="/blabla/([0-9]+)"[^>]*>([^<]*)</a>|Uis',$html,$matches);

This is a part of big regular expression, I just simplified it for example.

Solution

This is the fundamental issue with trying to regex HTML. This is not really good HTML - because contents that are not meant to be interpreted as HTML should be html entities (aka &lte; instead of <). You won't always be able to handle that though.

In your case, something like this works for regex:

|<a href="/blabla/([0-9]+)">.*?</a>|Uis

The matching group gets shifted. This also allows nested tags (like <a><b><i></i></b></a>).

Keep in mind that the Ungreedy tag you used means that you can be a little more lax in your regex matching. If you wanted to do this without the U modifier you'd maybe need to do some negative lookaheads.

|<a href="/blabla/([0-9]+)">(?:(?!</a>).)*</a>|is