Search code examples
phpregexpreg-match-allmeta-tags

regex to pull all attributes out of all meta tags


I'm trying to pull meta tags out of a html page, to compare two pages (live and dev) to see if they're SEO is the same after a site redesign/refactor. I need to compare title, meta tags (description, opengraph etc.), h1's, our analytics (Omniture), and our ad tags (doubleclick) are all the same.

My problem is getting meta tags http://php.net/manual/en/function.get-meta-tags.php only works if they have a name= attribute, same with "mariano at cricava dot com"'s solution.

I don't want to restrict it to having certain attributes, I could make the assumption that all our meta tags have either a name=, or property= or http-equiv= and change the regex appropriately but cannot be entirely sure as it's a massive website and any random crap could be in the tags (hence this tool is to check this stuff!) and would like to leave it as dynamic as possible.

I have

$page = @file_get_contents('http://.../');
preg_match_all('#<meta(?:\s+?([^\=]+)\=\"(.+?)\")+?\s*?/?>#sui', $page, $matches, PREG_SET_ORDER)

but the subpatterns override each other, so this only pulls out the last attribute-name=attribute-value pair

Array
(
    [0] => Array
        (
            [0] => <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
            [1] => content
            [2] => text/html; charset=UTF-8
        )

    [1] => Array
        (
            [0] => <meta name="description" content="some description" />
            [1] => content
            [2] => some description
        )

    [2] => Array
        (
            [0] => <meta property="og:type" content="website" />
            [1] => content
            [2] => website
        )
...

I need all the attributes for all the meta tags. I could do this in two steps, pulling the contents of <meta ([^>]*)> then doing a second regular expression on the results, but that seems unnecessary with the power of regex?


Solution

  • But back to the original question, forget it's HTML for now, is there no way to have recurring subpatterns return in preg_match_all rather than just returning the last match?

    Not possible with preg_*/PCRE (nor any other regex flavor that I know of, but in Perl you could use a (?{ push @list, $^N }) hack).