Search code examples
symfonyweb-scrapingweb-crawlerguzzlegoutte

Goutte Scraper Parse through Page Object


it's been kind of a learning experience for me, but using Symfony and Goutte. I've been able to login to a secure website then return a page.

echo $crawler->html(); 

What I want to do now is parse through the object $crawler. What confuses me is Goutte doesn't seem to show much about how to do this. I think a lot of people have used Guzzle along with Goutte, but I can't do a use Guzzle\Client; statement along with use Goutte\Client;.

All I want to do is parse through the $crawler object to find certain things in the html source code. (Note: this specific page does not use id's or classes, so I can't do filter('#stuff') or filter('.stuff').)

Can someone help explain to me how to use Goutte to parse through the object I've gotten?

(edit: I wanted to specify, I'm trying to perhaps just search for a string or something. Can I convert the $crawler object to plain text source code then just do a preg_match or something?)


Solution

  • The $crawler is an instance of the Symfony DomCrawler Component; which is actually set of DOMElement objects.

    The crawler provides quite a bit of functionality for filtering individual nodes by using XPath queries

    $crawler = $crawler->filterXPath('descendant-or-self::body/p');
    

    or by using CSS Selectors.

    $crawler = $crawler->filter('body > p');
    

    By using either, it is possible to filter your document using HTML entities rather than by attributes. More information on CSS Selectors can be found here (it was the first link from a Google search).

    The ability to output the inner HTML of a crawler object was added in 2.3 and can be completed by:

    The DomCrawler::html() method was added in Symfony 2.3 and it will "return the first node of the list as HTML".

    $html = $crawler->html();
    

    It should be noted that when you perform a filter, a new crawler object is returned with a list of DOMElements which can lead to some unexpected results (at least that's what I have experienced).

    Edit: In response to your comment, it is entirely possible to filter based off of the new criteria (reference the comment below).

    You use a CSS Selector like:

    [attribute=value]

    So your code would look like:

    $crawler = $crawler->filter('a[href=' . $value . ']');
    

    Accessing node values can be as simple as using the DOMCrawler Supplied Functions or by accessing the underlying DOMNode / NodeList / DOMElement elements.

    Behind the scenes the DomCrawler Component makes use of the CSS Selector Component