Search code examples
phpregexreplacesanitization

Change from preg_match() to preg_replace() and to remove matched <head> content


I know that using regular expressions on HTML is not preferred, but I am still confused as to why this doesn't work:

I'm trying to remove the "head" from a document.
Here's the doc:

<html>
 <head>
   <!--
     a comment within the head
     -->
 </head>
 <body>
stuff in the body
 </body>
</html>

My code:

$matches = array(); $result = preg_match ('/(?:<head[^>]*>)(.*?)(<\/head>)/is', $contents, $matches); 
var_dump ($matches);

This does not actually work. Here's the output I see:

array(3) { [0]=> string(60) " " [1]=> string(47) " " [2]=> string(7) "" }

However, if I adjust the HTML doc to not have the comment


Solution

  • Your script is working fine, it's not displaying correctly due to the HTML in the dump (you can tell by the lengths in your var_dump output). Try:

    $result = preg_match ('/(?:<head[^>]*>)(.*?)(<\/head>)/is', $contents, $matches); 
    ob_start(); // Capture the result of var_dump
    var_dump ($matches);
    echo htmlentities(ob_get_clean()); // Escape HTML in the dump
    

    Also, as has been said, you need to use preg_replace to replace the match with '' in order to actually remove the head.