Search code examples
phphtml-parsing

php-html-parser How to follow redirects


https://github.com/paquettg/php-html-parser Anybody knows how to to follow redirects in this library? For example:

require "vendor/autoload.php";
use PHPHtmlParser\Dom;
$dom = new Dom;
$dom->loadFromUrl($html);

Solution

  • Versions:

    • guzzlehttp/guzzle: "7.2.0"
    • paquettg/php-html-parser: "3.1.1"

    Why does the library not natively allow redirects?

    The loadFromUrl method has the following signature (at the time is 3.1.1)

        public function loadFromUrl(string $url, ?Options $options = null, ?ClientInterface $client = null, ?RequestInterface $request = null): Dom
        {
            if ($client === null) {
                $client = new Client();
            }
            if ($request === null) {
                $request = new Request('GET', $url);
            }
    
            $response = $client->sendRequest($request);
            $content = $response->getBody()->getContents();
    
            return $this->loadStr($content, $options);
        }
    

    Looking at the line $response = $client->sendRequest($request); it goes to Guzzle's Client - https://github.com/guzzle/guzzle/blob/master/src/Client.php#L131

    /**
    * The HttpClient PSR (PSR-18) specify this method.
    *
    * @inheritDoc
    */
    public function sendRequest(RequestInterface $request): ResponseInterface
    {
       $options[RequestOptions::SYNCHRONOUS] = true;
       $options[RequestOptions::ALLOW_REDIRECTS] = false;
       $options[RequestOptions::HTTP_ERRORS] = false;
    
       return $this->sendAsync($request, $options)->wait();
    }
    

    The $options[RequestOptions::ALLOW_REDIRECTS] = false; will automatically turn off redirects. No matter what you pass in with the Client or Request it will automatically turn off redirects.

    How to follow redirects with the library

    Observing that the method loadFromUrl will make the request and get the response then use loadStr we'll mimic the same but use Guzzle (as it's a dependency of the library).

    <?php
    // Include the autoloader
    use GuzzleHttp\Client;
    use GuzzleHttp\Exception\GuzzleException;
    use PHPHtmlParser\Dom;
    
    include_once("vendor/autoload.php");
    
    $client = new Client();
    try {
        // Showing the allow_redirects for verbosity sake. This is on by default with GuzzleHTTP clients.
        $request = $client->request('GET', 'http://theeasyapi.com', ['allow_redirects' => true]);
    
        // This would work exactly the same
        //$request = $client->request('GET', 'http://theeasyapi.com');
    } catch(GuzzleException $e) {
        // Probably do something with $e
        var_dump($e->getMessage());
        exit;
    }
    
    $dom = new Dom();
    $domExample = $dom->loadStr($request->getBody()->getContents());
    foreach($domExample->find('a') as $link) {
        var_dump($link->text);
    }
    

    The code above will instantiate a new Guzzle Client, and make a request to the URL allowing redirects. The website used in this example is a site that will 301 redirect from non-secure to secure.