I'm trying to pull meta tags out of a html page, to compare two pages (live and dev) to see if they're SEO is the same after a site redesign/refactor. I need to compare title, meta tags (description, opengraph etc.), h1's, our analytics (Omniture), and our ad tags (doubleclick) are all the same.
My problem is getting meta tags http://php.net/manual/en/function.get-meta-tags.php only works if they have a name= attribute, same with "mariano at cricava dot com"'s solution.
I don't want to restrict it to having certain attributes, I could make the assumption that all our meta tags have either a name=, or property= or http-equiv= and change the regex appropriately but cannot be entirely sure as it's a massive website and any random crap could be in the tags (hence this tool is to check this stuff!) and would like to leave it as dynamic as possible.
I have
$page = @file_get_contents('http://.../');
preg_match_all('#<meta(?:\s+?([^\=]+)\=\"(.+?)\")+?\s*?/?>#sui', $page, $matches, PREG_SET_ORDER)
but the subpatterns override each other, so this only pulls out the last attribute-name=attribute-value pair
Array
(
[0] => Array
(
[0] => <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
[1] => content
[2] => text/html; charset=UTF-8
)
[1] => Array
(
[0] => <meta name="description" content="some description" />
[1] => content
[2] => some description
)
[2] => Array
(
[0] => <meta property="og:type" content="website" />
[1] => content
[2] => website
)
...
I need all the attributes for all the meta tags. I could do this in two steps, pulling the contents of <meta ([^>]*)>
then doing a second regular expression on the results, but that seems unnecessary with the power of regex?
But back to the original question, forget it's HTML for now, is there no way to have recurring subpatterns return in preg_match_all rather than just returning the last match?
Not possible with preg_*
/PCRE (nor any other regex flavor that I know of, but in Perl you could use a (?{ push @list, $^N })
hack).