Search code examples
phpsymfonyweb-scrapinggoutte

Using goutte to read from a file / string


I'm using Goutte to make a webscraper.

For development, I've saved a .html document I'd like to traverse (so i'm not constantly making requests to the website). Here's what I have so far:

use Goutte\Client;

$client = new Client();
$html=file_get_contents('test.html');
$crawler = $client->request(null,null,[],[],[],$html);

Which based of what I know should call request in Symfony\Component\BrowserKit, and pass in the raw body data. Here's the error message I'm getting:

PHP Fatal error:  Uncaught exception 'GuzzleHttp\Exception\ConnectException' with message 'cURL error 7: Failed to connect to localhost port 80: Connection refused (see http://curl.haxx.se/libcurl/c/libcurl-errors.html)' in C:\Users\Ally\Sites\scrape\vendor\guzzlehttp\guzzle\src\Handler\CurlFactory.

If I were to just use DomCrawler, it's non-trivial to create a crawler using a string. (see: http://symfony.com/doc/current/components/dom_crawler.html). I'm just unsure about how to do the equivalent with Goutte.

Thanks in advance.


Solution

  • Tools you decided to use make real http connections and are not suitable for what you want to do. At least out of the box.

    Option 1: Implement your own BrowserKit Client

    All goutte does is it extends BrowserKit's Client. It implements http requests with Guzzle.

    All you need to do to implement your own client, is to extend the Symfony\Component\BrowserKit\Client and provide the doRequest() method:

    use Symfony\Component\BrowserKit\Client;
    use Symfony\Component\BrowserKit\Request;
    use Symfony\Component\BrowserKit\Response;
    
    class FilesystemClient extends Client
    {
        /**
         * @param object $request An origin request instance
         *
         * @return object An origin response instance
         */
        protected function doRequest($request)
        {
            $file = $this->getFilePath($request->getUri());
    
            if (!file_exists($file)) {
                return new Response('Page not found', 404, []);
            }
    
            $content = file_get_contents($file);
    
            return new Response($content, 200, []);
        }
    
        private function getFilePath($uri)
        {
            // convert an uri to a file path to your saved response
            // could be something like this:
            return preg_replace('#[^a-zA-Z_\-\.]#', '_', $uri).'.html';
        }
    }
    
     $client = new FilesystemClient();
     $client->request('GET', '/test');
    

    Client's request() needs to accept real URIs, therefore you need to implement your own logic to convert it to a filesystem location.

    Have a look at Goutte's Client for insipration.

    Option 2: Implement a custom Guzzle handler

    Since Goutte uses Guzzle, you could provide your own Guzzle handler that would load responses from files, instead of making real http requests. Have a look at the handlers and middleware doc.

    If you're just after caching responses so you make less http requests, Guzzle provides support for this already.

    Option 3: Use DomCrawler directly

    new Crawler(file_get_contents('test.html'))
    

    The only drawback is you'll loose some of convenience methods of the BrowserKit client, like click() or selectLink().