Search code examples
phpregexpreg-match

Regex to replace webpage meta description apostrophe using preg_match


I have this data:

<meta name="description" content="Access Kenya is Kenya's leading corporate Internet service provider and is a technology solutions provider in Kenya with IT and network solutions for your business.Welcome to the Yellow Network.Kenya's leading Corporate and Residential ISP" />;

I am using this Regular Expression:

<meta +name *=[\"']?description[\"']? *content=[\"']?([^<>'\"]+)[\"']?

To get webpage description All works fine but everything stalls everywhere there is an apostrophe.

How do I escape that?


Solution

  • Your regular expression consider these three options for a <meta> node:

    <meta name="description" content="Some Content" />
    <meta name='description' content='Some Content' />
    <meta name=description content=Some Content />
    

    The third option is not valid HTML, but all can happen, so... you are right.

    The simple way is to modify your original regular expression closing tag and using the ? not-greedy operator:

    <meta +name *=[\"']?description[\"']? *content=[\"']?(.*?)[\"']? */?>
                                                          └─┘       └───┘
              search zero-or-more characters except following       closing tag characters
    

    regex101 demo

    But — also in this case — what happen if you have this meta?

    <meta content="Some Content" name="description" />
    

    Your regular expression will fail.

    To real match a HTML node, you must use a parser:

    $dom = new DOMDocument();
    libxml_use_internal_errors(1);
    $dom->loadHTML( $yourHtmlString );
    $xpath = new DOMXPath( $dom );
    
    $description = $xpath->query( '//meta[@name="description"]/@content' );
    echo $description->item(0)->nodeValue);
    

    will output:

    Some Content
    

    Yes, it's 5 lines vs 1, but with this method you will match any <meta name="description"> (also if it contains a third, not valid attribute).