Search code examples
phpregexhtml-parsingpreg-matchtext-extraction

Get href of all <a> tags


I am trying to get some preg_match() done.

I have basically come up with this:

preg_match_all('<a href="(.*?)">', $page, $result);

but the output of this is:

Array
(
    [0] => Array
     (
        [0] => a href="/stuff"
        [1] => a href="/stuffstuffstuff"
        
         and much more of this.

I want to remove the the href and the slashes and quotes and keep only the content.


Solution

  • First thing, please do NOT try to parse random html with regex, it is not going to work, it's going to break, sooner or later. Regex is not tool for parsing html, it CanNOT parse it correctly. 3 simple examples:

    <a href='stuff'> (different quotes)
    <!-- <a href="stuff">-->
    <a style='something' href="stuff">
    

    theese are going to break your application. There is infinite amount of other examples, which will not work and are gonna break it! Not even Chuck Norris can parse html with regex correctly, NOONE can!

    But I assume you already know that, and this is just small simple limited amount of known html, which isn't going to be released in public, so lets back to your question:

    preg_match_all expects the regex with delimiting characters and it matches all that stuff you write between them. If you write

    '<a href="(.*?)">' 
    

    as a regex, it treats the '<' at the begining as a delimiting character, thus not matching it. Write slashes (or any other characters) arround it:

    preg_match_all('/<a href="(.*?)">/', $page, $result);
    

    Now, it's going to match like:

    [0] => <a href="/stuff">
    

    But you want only the '/stuff'. $result gives you an array. In $result[0] is all the regex matched, in $result[1] is first () matched, in $result[2] would be second ( ) sub-expression matched, and so on... So, you want to look in $result[1], you should find what you want there.