I'm trying to get information in DB from html files, and suddenly found that link can be like:
<a href="/blabla/12345678" class="someclass">channel crosstalk: <60dB</a>
there for my regular expression doesn't find that link:
preg_match_all('|<a href="/blabla/([0-9]+)"[^>]*>([^<]*)</a>|Uis',$html,$matches);
This is a part of big regular expression, I just simplified it for example.
This is the fundamental issue with trying to regex HTML. This is not really good HTML - because contents that are not meant to be interpreted as HTML should be html entities (aka <e;
instead of <
). You won't always be able to handle that though.
In your case, something like this works for regex:
|<a href="/blabla/([0-9]+)">.*?</a>|Uis
The matching group gets shifted. This also allows nested tags (like <a><b><i></i></b></a>
).
Keep in mind that the Ungreedy tag you used means that you can be a little more lax in your regex matching. If you wanted to do this without the U
modifier you'd maybe need to do some negative lookaheads.
|<a href="/blabla/([0-9]+)">(?:(?!</a>).)*</a>|is