Search code examples
phpregexmeta-tags

Modified PHP get_meta_tags not working for some URLs


I am trying to use the code from the user contributed notes on php.net for the get_meta_tags function. From what it seems, if the meta tag is formatted <meta content="foo" name="bar" /> then the code will miss it. Currently, only tags formatted as <meta name="bar" content="foo"/> will work. I'm not great with regex and tried unsuccessfully to fix it. Here is an example of a url that seems to slip through the regex. Apologies in advance that my question isn't necessarily about the get_meta_tags function but it seems that this may be losely related to some of the other issues people have been having with that function.

It seems like the problem is somewhere around here:

preg_match_all('/<[\s]*meta[\s]*(name|property)="?' . '([^>"]*)"?[\s]*' . 'content="?([^>"]*)"?[\s]*[\/]?[\s]*>/si', $contents, $match);

which may need to be something like:

preg_match_all('/<[\s]*meta[\s]*(name|property|content)="?' . '([^>"]*)"?[\s]*' . '(content|name)="?([^>"]*)"?[\s]*[\/]?[\s]*>/si', $contents, $match);

But again, I'm pretty terrible with regex. Any ideas?


Solution

  • An idea is to capture meta name/property inside a lookahead for being independent of sequence:

    function extract_meta_tags($source)
    {
      $pattern = '
      ~<\s*meta\s
    
      # using lookahead to capture type to $1
        (?=[^>]*?
        \b(?:name|property|itemprop|http-equiv)\s*=\s*
        (?|"\s*([^"]*?)\s*"|\'\s*([^\']*?)\s*\'|
        ([^"\'>]*?)(?=\s*/?\s*>|\s\w+\s*=))
      )
    
      # capture content to $2
      [^>]*?\bcontent\s*=\s*
        (?|"\s*([^"]*?)\s*"|\'\s*([^\']*?)\s*\'|
        ([^"\'>]*?)(?=\s*/?\s*>|\s\w+\s*=))
      [^>]*>
    
      ~ix';
    
      if(preg_match_all($pattern, $source, $out))
        return array_combine(array_map('strtolower', $out[1]), $out[2]);
      return array();
    }
    

    See test at regex101. Used the branch reset feature for extracting values of different quote styles.

    print_r(extract_meta_tags($str)); Try with some different data at eval.in


    Use this on html <head> section. To get page source and extract head:

    1.) Get source by using cURL, file_get_contents or fsockopen.

    2.) Extract <head> by using dom or regex like this: (?is)<head\b[^>]*>(.*?)</head>

    3.) Extract meta tags from <head> by using provided regex or try with a parser.