Search code examples
phpdomdocumentdomxpath

$domxpath->query - Table Contents


has about two days I received the recommendation to use DOM document instead of regex

I still do not know how to use the query correctly

in the link below is the session "TERRITÓRIO E AMBIENTE", I would like to get the contents of the 4 lines below this

https://cidades.ibge.gov.br/brasil/sp/sao-paulo/panorama

$html = file_get_contents( 'https://cidades.ibge.gov.br/brasil/sp/sao-paulo/panorama' );    
            $document = new DOMDocument();              
            $document->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
            $domxpath = new DOMXPath($document);
            $paragraphs = $domxpath->query('
                //th[*[
                        contains(text(), "TERRITÓRIO E AMBIENTE")
                      ]
                    ]
                /following-sibling::tr[
                        position() = 12 
                    ]'
            );

I put the amount of 12 <tr> because that is what appears in the source code, but I do not know if I'm doing this query right, this is appearing these errors for me

Warning: DOMDocument::loadHTML(): Tag app invalid in Entity, line: 25 
Warning: DOMDocument::loadHTML(): Misplaced DOCTYPE declaration in Entity, line: 25
Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <html> tag in Entity, line: 25 

thanks


Solution

  • There are several issues in your code.

    • The HTML you get from that website is invalid, so you need to ignore errors (this is generally not recommended but in this case I think it's OK).

    @$document->loadHTML($html);
    
    • The text you're looking for is in lowercase (it's showing in uppercase because of its style), so you need to either normalize it or put the text in lowercase
    • Your approach (getting the 12th child) is too brittle. I inspected the code a little and it's hard make it less brittle but I think this comes close:

    //th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[1]/td[3]
    

    That gets a th element containing the text Território e Ambiente, then gets the parent tr tag, then goes to the next tr sibling, and finally gets the third td element (where the value is). Still very brittle but keep an eye on changes in the website, it's unlikely to change.

    So now you need to repeat that XPath query 3 more times, changing the nth tr sibling (adding two, because there's an empty element in the middle of each). It ends up looking something like this:

    $document = new DOMDocument();
    @$document->loadHTML($html);
    $domxpath = new DOMXPath($document);
    $paragraphs = $domxpath->query('//th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[1]/td[3]');
    echo "First: ".preg_replace('/\s+/', ' ', $paragraphs[0]->nodeValue);
    echo "<br>";
    $paragraphs = $domxpath->query('//th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[3]/td[3]');
    echo "Second: ".preg_replace('/\s+/', ' ', $paragraphs[0]->nodeValue);
    echo "<br>";
    $paragraphs = $domxpath->query('//th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[5]/td[3]');
    echo "Third: ".preg_replace('/\s+/', ' ', $paragraphs[0]->nodeValue);
    echo "<br>";
    $paragraphs = $domxpath->query('//th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[7]/td[3]');
    echo "Fourth: ".preg_replace('/\s+/', ' ', $paragraphs[0]->nodeValue);
    

    First: 1.521,110 km²
    Second: 92,6 %
    Third: 74,8 %
    Fourth: 50,3 %

    Note the use of preg_replace() to get rid of the abundant whitespace.

    And using a little more XPath magic we can get it to work with only one query:

    //th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[position() mod 2 = 1]/td[3]
    

    Works the same as the others, but instead of getting a specific tr sibling element, gets every other one.

    $paragraphs = $domxpath->query('//th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[position() mod 2 = 1]/td[3]');
    foreach ($paragraphs as $i => $p) {
        echo ($i + 1)." value: ".preg_replace('/\s+/', ' ', $p->nodeValue);
        echo "<br>";
    }