Getting all attributes from an <a> HTML tag with regex

I already have a function that retrieves the href attribute from all of the a tags on a given page of markup. However, I would also like to retrieve other attributes, namely the title attribute.

I have a feeling it's a simple modification of the regular expression that I'm already using, but my only concern is the order of appearance in the markup. If I have a link with this code:

<a href="somepage.html" title="My Page">link text</a>

I want it to be parsed the same and not cause any errors even if it appears like this:

<a title="My Page" href="somepage.html">link text</a>

Here is my processing function:

function getLinks($src) {
    if(preg_match_all('/<a\s+href=["\']([^"\']+)["\']/i', $src, $links, PREG_PATTERN_ORDER))
        return array_unique($links[1]);
    return false;
}

Would I have to use another regex all together, or would it be possible to modify this one so that the title attribute is stored in the same array of returned data as the href attribute?

Solution

You can build on that regex. Have a look:

'/<a(?:\s+(?:href=["\'](?P<href>[^"\'<>]+)["\']|title=["\'](?P<title>[^"\'<>]+)["\']|\w+=["\'][^"\'<>]+["\']))+/i'

...or in human-readable form:

preg_match_all(
    '/<a
    (?:\s+
      (?:
         href=["\'](?P<href>[^"\'<>]+)["\']
        |
         title=["\'](?P<title>[^"\'<>]+)["\']
        |
         \w+=["\'][^"\'<>]+["\']
      )
    )+/ix', 
    $subject, $result, PREG_PATTERN_ORDER);

Pretty self explanatory, I think. Note that your original regex has the same problem vis-à-vis order of appearance. For example, it would fail to match this tag:

<a class="someclass" href="somepage.html">link text</a>

Unless you're absolutely sure there will be no other attributes, you can't reasonably expect href to be listed first. You can use the same gimmick as above, where the second branch silently consumes and discards the attributes that don't interest you:

    '/<a
    (?:\s+
      (?:
         href=["\'](?P<href>[^"\'<>]+)["\']
        |
         \w+=["\'][^"\'<>]+["\']
      )
    )+/ix',