Search code examples
phpsimple-html-dom

Simple HTML Dom: Find <p> after removing <table>


So I'm writing a basic Wikipedia page crawler to pick up the first link in the first paragraph of the article. My current strategy involved finding the first paragraph, then finding the first link in that paragraph (checking for exceptions). However, some Wikipedia articles have their first paragraph tags inside of tables—which I don't want. So I'm trying to remove all tables in the page first before finding the paragraph.
But after I remove the tables, my "find" function for the first paragraph still returns the paragraph inside the table I thought I had removed from the html. Any ideas?

    $html = new simple_html_dom();
    $html->load_file($new_target);

    if (!empty($html->find('table'))) {
        foreach($html->find('table') as $table) {
            $table->innertext = '';
            $table->outertext = '';
        }
    }

    $p = $html->find('p', 0);
    // this returns a paragraph that is inside a table I just deleted.

Solution

  • You can do this with the standard DOMDocument object, like this:

    $dom = new DOMDocument();
    $dom->load($yourHtmlFile);
    foreach ($dom->getElementsByTagName('table') as $table) {
        $table->parentNode->removeChild($table);
    }
    foreach ($dom->getElementsByTagName('p') as $para) {
        $paraHtml = $dom->saveHTML($para);
        echo $paraHtml;
        break; // do not process other p-tags.
    };