Search code examples
phphtmlregexweb-scraping

How to regex scrape HTML and ignore whitespace and newlines in code?


I'm putting together a quick script to scrape a page for some results and I'm having trouble figuring out how to ignore white space and new lines in my regex.

For example, here's how the page may present a result in HTML:

<td class="things">
    <div class="stuff">
        <p>I need to capture this text.</p>
    </div>
</td>

How would I change the following regex to ignore the spaces and new lines:

$regex = '/<td class="things"><div class="stuff"><p>(.*)<\/p><\/div><\/td>/i';

Any help would be appreciated. Help that also explains why you did something would be greatly appreciated!


Solution

  • Needless to caution you that you're playing with fire by trying to use regex with HTML code. Anyway to answer your question you can use this regex:

    $regex='#^<td class="things">\s*<div class="stuff">\s*<p>(.*)</p>\s*</div>\s*</td>#si';
    

    Update: Here is the DOM Parser based code to get what you want:

    $html = <<< EOF
    <td class="things">
        <div class="stuff">
            <p>I need to capture this text.</p>
        </div>
    </td>
    EOF;
    $doc = new DOMDocument();
    libxml_use_internal_errors(true);
    $doc->loadHTML($html); // loads your html
    $xpath = new DOMXPath($doc);
    $nodelist = $xpath->query("//td[@class='things']/div[@class='stuff']/p");
    for($i=0; $i < $nodelist->length; $i++) {
        $node = $nodelist->item($i);
        $val = $node->nodeValue;
        echo "$val\n"; // prints: I need to capture this text.
    }
    

    And now please refrain from parsing HTML using regex in your code.