Search code examples
phpweb-scrapingserializationrecaptchagoutte

Goutte Client how to store and retrive $crawler?


My code is like

        <?php
    require_once 'vendor/autoload.php';
    use Goutte\Client;
    use Symfony\Component\HttpClient\HttpClient;
    //generate random string
    function generateRandomString($length = 10)
    {
        $characters = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
        $charactersLength = strlen($characters);
        $randomString = '';
        for ($i = 0; $i < $length; $i++) {
            $randomString .= $characters[rand(0, $charactersLength - 1)];
        }
        return $randomString;
    }
    //creating Goutte Client
    $client = new Client(HttpClient::create(array(
        'headers' => array(
            'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language' => 'en-US,en;q=0.5',
            'Connection' => 'keep-alive',
        ),
    )));
    //Request
    $crawler = $client->request('GET', 'example.com/login');
    $session_id = generateRandomString(15);
    
    //For write php object in the text file...
    $objData = serialize($crawler);
    $filePath = getcwd() . DIRECTORY_SEPARATOR . "sessions" . DIRECTORY_SEPARATOR . "obj" . $session_id . ".txt";
    $fp = fopen($filePath, "w");
    fwrite($fp, $objData);
    fclose($fp);
    
    //To read the text file to get the object
    $crawler_new = file_get_contents(getcwd() . DIRECTORY_SEPARATOR . "sessions" . DIRECTORY_SEPARATOR . "obj" . $session_id . ".txt");
    $obj = unserialize($crawler_new);
    
    print_r($obj);
    die();

above code results below

    Warning:  print_r(): Invalid State Error in C:\xampp\htdocs\verisys\index.php on line 80
    
    Warning:  print_r(): Invalid State Error in C:\xampp\htdocs\verisys\index.php on line 80
    
    Warning:  print_r(): Invalid State Error in C:\xampp\htdocs\verisys\index.php on line 80
           
    Warning:  print_r(): Invalid State Error in C:\xampp\htdocs\verisys\index.php on line 80
                    
    Warning:  print_r(): Invalid State Error in C:\xampp\htdocs\verisys\index.php on line 80
    .
    .
    . 
    Warning:  print_r(): Invalid State Error in C:\xampp\htdocs\verisys\index.php on line 80
    
            Symfony\Component\DomCrawler\Crawler Object
    (
    [uri:protected] => example.com/login/
    [defaultNamespacePrefix:Symfony\Component\DomCrawler\Crawler:private] => default
    [namespaces:Symfony\Component\DomCrawler\Crawler:private] => Array
    (
    )
    
    [baseHref:Symfony\Component\DomCrawler\Crawler:private] => example.com/login/
    [document:Symfony\Component\DomCrawler\Crawler:private] => DOMDocument Object
    (
    [implementation] => (object value omitted)
    [strictErrorChecking] =>
    [config] =>
    [formatOutput] =>
    [validateOnParse] =>
    [resolveExternals] =>
    [preserveWhiteSpace] =>
    [recover] =>
    [substituteEntities] =>
    )
    
    [nodes:Symfony\Component\DomCrawler\Crawler:private] => Array
    (
    [0] => DOMElement Object
    (
    [schemaTypeInfo] =>
    )
    
    )
    
    [isHtml:Symfony\Component\DomCrawler\Crawler:private] => 1
    [html5Parser:Symfony\Component\DomCrawler\Crawler:private] =>
    )

any one can help me to store the $crawler object in file?? basically want to ask from client to put reCaptcha by human. I am working on a project in which I want to perform all process through server using Goutte and for this on login page reCaptcha is applied which I want to get fill by client side and then will continue the other process.


Solution

  • Just create same client again:

    $cokie = "JSESSIONID=0000H_WHw_eFPKVUDGxUei7v3PH:1db7cfi4s";
    $client = new Client(HttpClient::create(array(
        'headers' => array(
            'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language' => 'en-US,en;q=0.5',
            'Connection' => 'keep-alive',
            'Host' => 'verification.nadra.gov.pk',
            "Cookie" => $cokie,
            'User-Agent' => 'Mozilla/5.0 (Windows NT x.y; Win64; x64; rv:10.0) Gecko/20100101 Firefox/10.0'
        ),
    )));
    $cookie = new Cookie("JSESSIONID", $cokie, null, "/service", "https://example.com/", true, true);
    $client->getCookieJar()->set($cookie);
    $client->setServerParameter('HTTP_USER_AGENT', 'Mozilla/5.0 (Windows NT x.y; Win64; x64; rv:10.0) Gecko/20100101 Firefox/10.0');
    $client->followRedirects(true);
    $crawler = $client->request('GET', 'https://example.com/service/botdetectcaptcha?get=image&amp;c=exampleCaptcha&amp;t=508c5eaf74fd4858b0c9debafc319d67');