has about two days I received the recommendation to use DOM document instead of regex
I still do not know how to use the query correctly
in the link below is the session "TERRITÓRIO E AMBIENTE", I would like to get the contents of the 4 lines below this
https://cidades.ibge.gov.br/brasil/sp/sao-paulo/panorama
$html = file_get_contents( 'https://cidades.ibge.gov.br/brasil/sp/sao-paulo/panorama' );
$document = new DOMDocument();
$document->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
$domxpath = new DOMXPath($document);
$paragraphs = $domxpath->query('
//th[*[
contains(text(), "TERRITÓRIO E AMBIENTE")
]
]
/following-sibling::tr[
position() = 12
]'
);
I put the amount of 12 <tr>
because that is what appears in the source code, but I do not know if I'm doing this query right, this is appearing these errors for me
Warning: DOMDocument::loadHTML(): Tag app invalid in Entity, line: 25
Warning: DOMDocument::loadHTML(): Misplaced DOCTYPE declaration in Entity, line: 25
Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <html> tag in Entity, line: 25
thanks
There are several issues in your code.
@$document->loadHTML($html);
//th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[1]/td[3]
That gets a th
element containing the text Território e Ambiente
, then gets the parent tr
tag, then goes to the next tr
sibling, and finally gets the third td
element (where the value is). Still very brittle but keep an eye on changes in the website, it's unlikely to change.
So now you need to repeat that XPath query 3 more times, changing the nth tr
sibling (adding two, because there's an empty element in the middle of each). It ends up looking something like this:
$document = new DOMDocument();
@$document->loadHTML($html);
$domxpath = new DOMXPath($document);
$paragraphs = $domxpath->query('//th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[1]/td[3]');
echo "First: ".preg_replace('/\s+/', ' ', $paragraphs[0]->nodeValue);
echo "<br>";
$paragraphs = $domxpath->query('//th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[3]/td[3]');
echo "Second: ".preg_replace('/\s+/', ' ', $paragraphs[0]->nodeValue);
echo "<br>";
$paragraphs = $domxpath->query('//th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[5]/td[3]');
echo "Third: ".preg_replace('/\s+/', ' ', $paragraphs[0]->nodeValue);
echo "<br>";
$paragraphs = $domxpath->query('//th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[7]/td[3]');
echo "Fourth: ".preg_replace('/\s+/', ' ', $paragraphs[0]->nodeValue);
First: 1.521,110 km²
Second: 92,6 %
Third: 74,8 %
Fourth: 50,3 %
Note the use of preg_replace()
to get rid of the abundant whitespace.
And using a little more XPath magic we can get it to work with only one query:
//th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[position() mod 2 = 1]/td[3]
Works the same as the others, but instead of getting a specific tr
sibling element, gets every other one.
$paragraphs = $domxpath->query('//th[contains(text(), "Território e Ambiente")]/parent::tr/following-sibling::tr[position() mod 2 = 1]/td[3]');
foreach ($paragraphs as $i => $p) {
echo ($i + 1)." value: ".preg_replace('/\s+/', ' ', $p->nodeValue);
echo "<br>";
}