I have this data:
<meta name="description" content="Access Kenya is Kenya's leading corporate Internet service provider and is a technology solutions provider in Kenya with IT and network solutions for your business.Welcome to the Yellow Network.Kenya's leading Corporate and Residential ISP" />;
I am using this Regular Expression:
<meta +name *=[\"']?description[\"']? *content=[\"']?([^<>'\"]+)[\"']?
To get webpage description All works fine but everything stalls everywhere there is an apostrophe.
How do I escape that?
Your regular expression consider these three options for a <meta>
node:
<meta name="description" content="Some Content" />
<meta name='description' content='Some Content' />
<meta name=description content=Some Content />
The third option is not valid HTML, but all can happen, so... you are right.
The simple way is to modify your original regular expression closing tag and using the ?
not-greedy operator:
<meta +name *=[\"']?description[\"']? *content=[\"']?(.*?)[\"']? */?>
└─┘ └───┘
search zero-or-more characters except following closing tag characters
But — also in this case — what happen if you have this meta?
<meta content="Some Content" name="description" />
Your regular expression will fail.
To real match a HTML node, you must use a parser:
$dom = new DOMDocument();
libxml_use_internal_errors(1);
$dom->loadHTML( $yourHtmlString );
$xpath = new DOMXPath( $dom );
$description = $xpath->query( '//meta[@name="description"]/@content' );
echo $description->item(0)->nodeValue);
will output:
Some Content
Yes, it's 5 lines vs 1, but with this method you will match any <meta name="description">
(also if it contains a third, not valid attribute).