Search code examples
phpcss-selectorsgoutte

Emtpy return from link using goutte


I am running PHP 7.3.5 and "fabpot/goutte": "^3.2".

I am trying to scrape the introduction and the date from a link, however, I get nothing in return back.

Find below my minimum viable example:

<?php
require_once 'vendor/autoload.php';

use Goutte\Client;

$client = new Client();

$url = 'body > div.container > div > div > ul.list-group.mb-5 > a';
$intr = 'body > div:nth-child(3) > div:nth-child(2) > div > table:nth-child(10) > tbody > tr > td > div > div:nth-child(1) > div > div > div > div > table > tbody > tr > th > table:nth-child(4) > tbody > tr > td';
$dat = 'body > div:nth-child(3) > div:nth-child(2) > div > table:nth-child(10) > tbody > tr > td > div > div:nth-child(1) > div > div > div > div > table > tbody > tr > th > table:nth-child(1) > tbody > tr > td:nth-child(1)';

//arrays
$introArr = array();
$urlArr = array();

$crawler = $client->request('GET', 'https://www.morningbrew.com/daily/2019/11/07');
$intro = $crawler->filter($intr)->each(function($node) {
    return $node;
});
$date = $crawler->filter($dat)->each(function($node) {
    return $node->html();
});
array_push( $introArr, $intro, $date);

I would like to get back:

enter image description here

Any suggestions what I am doing wrong?

I appreciate your replies!


Solution

  • The selectors that you provide to the filter() method (for both $intro and $date) points to nothing in the document's DOM tree.


    First of all, a little precision about those chained selectors you came up with :

    $intr = 'body > div:nth-child(3) > ...';
    

    Just in case you don't know, it's not necessary to start from the root node (the body tag) to find an element. For example, if I wanted to retrieve the .myDiv element(s), I could just do the following :

    $crawler->filter('.myDiv');
    

    DOM parsers are also there to avoid you the pain of traversing all nodes to find a specific or multiple elements, wherever they are in the tree.


    Fore more simplicity, try to rely as less as possible on HTML tags to find a node, and use CSS class selectors whenever you can.

    Working example :

    $subCrawler = $client->request('GET', 'https://www.morningbrew.com/daily/2019/11/07');
    
    $date = $subCrawler->filter('.pcard')
                       ->filter('table:first-child')
                       ->filter('td:first-child')
                       ->text();
    
    $text = $subCrawler->filter('.pcard')
                       ->filter('table:nth-child(4)')
                       ->text();
    

    Notes :

    • As we only expect one node, there's no need to iterate with each() to retrieve the node's content

    • filter() calls are chained here for more readability, but it's a matter of preference. Concatenating all selectors into one is also valid.