Search code examples
phpweb-scrapinggouttedomcrawler

Goutte extract text with tags


While trying to learn and use Goutte to scrape websites for descriptions, it does retrieve text but removes all tags (i.e. <br><b>). Is there a way to retrieve the values of all text within the div, including html tags? Or is there an easier alternative way that does give me this ability?

    <?php 
            require_once "vendor/autoload.php";
            use Goutte\Client;

            // Init. new client
            $client = new Client();
            $crawler = $client->request('GET', "examplesite.com/example");

            // Crawl response
            $description = $crawler->filter('element.class')->extract('_text');
    ?>

Solution

  • You can use the html() frunction

    http://api.symfony.com/4.0/Symfony/Component/DomCrawler/Crawler.html#method_html

    Like this

    $descriptions = $crawler->filter('element.class')->each(function($node) {
        return $node->html();
    })
    

    After you can use strip_tags PHP function to clean it up

    http://php.net/manual/fr/function.strip-tags.php