Search code examples
phpsimple-html-dom

How to get <a> tags in <body> but exclude header and footer sections


If I have a webpage like this:

<body>
  <header>
    <a href='http://domain1.com'>link 1 text</a>
  </header>

  <a href='http://domain2.com'>link 2 text</a>

  <footer>
    <a href='http://domain3.com'>link 3 text</a>
  </footer>
</body>

How do I pull the <a> tags out of the <body> but exclude the links from <header> and <footer>?

In the real web page, there will be a lot of <a> tags in the <header> so I'd rather not have to cycle through ALL of them.

I want to pull out the URLs and anchor text from each of the <a> tags that are NOT inside the <header> or <footer> tags.

EDIT: this is how I find links in the header:

$header = $html->find('header',0);
foreach ($header->find('a') as $a){
  do something
}

I would like to do this (note the use of "!")

$foo = $html->find('!header,!footer');
foreach ($foo->find('a') as $a){
  do something
}

Solution

  • Remove the header and footer from the DOM you are working with before looking for the links.

    <?php
        include("simple_html_dom.php");
        $source = <<<EOD
        <body>
            <header>
                <a href='http://domain1.com'>link 1 text</a>
            </header>
    
            <a href='http://domain2.com'>link 2 text</a>
    
            <a href='http://domain4.com'>link 4 text</a>
    
            <footer>
                <a href='http://domain3.com'>link 3 text</a>
            </footer>
        </body>
    EOD;
    
        $html = str_get_html($source);
        foreach ($html->find('header, footer') as $unwanted) {
            $unwanted->outertext = "";
        }
        $html->load($html->save()); 
        $links = $html->find("a");
        foreach ($links as $link) {
            print $link;
    };
    
    ?>