I am running PHP 7.3.5
and "fabpot/goutte": "^3.2"
.
I am trying to scrape the introduction and the date from a link, however, I get nothing in return back.
Find below my minimum viable example:
<?php
require_once 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();
$url = 'body > div.container > div > div > ul.list-group.mb-5 > a';
$intr = 'body > div:nth-child(3) > div:nth-child(2) > div > table:nth-child(10) > tbody > tr > td > div > div:nth-child(1) > div > div > div > div > table > tbody > tr > th > table:nth-child(4) > tbody > tr > td';
$dat = 'body > div:nth-child(3) > div:nth-child(2) > div > table:nth-child(10) > tbody > tr > td > div > div:nth-child(1) > div > div > div > div > table > tbody > tr > th > table:nth-child(1) > tbody > tr > td:nth-child(1)';
//arrays
$introArr = array();
$urlArr = array();
$crawler = $client->request('GET', 'https://www.morningbrew.com/daily/2019/11/07');
$intro = $crawler->filter($intr)->each(function($node) {
return $node;
});
$date = $crawler->filter($dat)->each(function($node) {
return $node->html();
});
array_push( $introArr, $intro, $date);
I would like to get back:
Any suggestions what I am doing wrong?
I appreciate your replies!
The selectors that you provide to the filter()
method (for both $intro
and $date
) points to nothing in the document's DOM tree.
First of all, a little precision about those chained selectors you came up with :
$intr = 'body > div:nth-child(3) > ...';
Just in case you don't know, it's not necessary to start from the root node (the body
tag) to find an element.
For example, if I wanted to retrieve the .myDiv
element(s), I could just do the following :
$crawler->filter('.myDiv');
DOM parsers are also there to avoid you the pain of traversing all nodes to find a specific or multiple elements, wherever they are in the tree.
Fore more simplicity, try to rely as less as possible on HTML tags to find a node, and use CSS class selectors whenever you can.
Working example :
$subCrawler = $client->request('GET', 'https://www.morningbrew.com/daily/2019/11/07');
$date = $subCrawler->filter('.pcard')
->filter('table:first-child')
->filter('td:first-child')
->text();
$text = $subCrawler->filter('.pcard')
->filter('table:nth-child(4)')
->text();
Notes :
As we only expect one node, there's no need to iterate with each()
to retrieve the node's content
filter()
calls are chained here for more readability, but it's a matter of preference. Concatenating all selectors into one is also valid.