I'm putting together a quick script to scrape a page for some results and I'm having trouble figuring out how to ignore white space and new lines in my regex.
For example, here's how the page may present a result in HTML:
<td class="things">
<div class="stuff">
<p>I need to capture this text.</p>
</div>
</td>
How would I change the following regex to ignore the spaces and new lines:
$regex = '/<td class="things"><div class="stuff"><p>(.*)<\/p><\/div><\/td>/i';
Any help would be appreciated. Help that also explains why you did something would be greatly appreciated!
Needless to caution you that you're playing with fire by trying to use regex with HTML code. Anyway to answer your question you can use this regex:
$regex='#^<td class="things">\s*<div class="stuff">\s*<p>(.*)</p>\s*</div>\s*</td>#si';
Update: Here is the DOM Parser based code to get what you want:
$html = <<< EOF
<td class="things">
<div class="stuff">
<p>I need to capture this text.</p>
</div>
</td>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//td[@class='things']/div[@class='stuff']/p");
for($i=0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i);
$val = $node->nodeValue;
echo "$val\n"; // prints: I need to capture this text.
}
And now please refrain from parsing HTML using regex in your code.