Search code examples
phpcurlcloudflare

php curl access to website with cloudflare 2021


I have been parsing sites for years using curl, but i'm having some unknown stuff about a website. Checking what ir returns it uses cloudfires and investigating about it i saw that it use some kind of mechanism to ignore bots but allow users.

What i dont understand i how it is able to do that, since it returns 403 code before any sending but if i do the same with chrome it works fine.

I have tested the "curl to bash and command line options" from chrome's inspector with the same result

this is the code that i'm using:

$headers=array(
    'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-language: es-ES,es;q=0.9',
    'upgrade-insecure-requests: 1',
    //'Referrer Policy: strict-origin-when-cross-origin',
    //'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
    );
    
    $agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36";


$url="https://www.pccomponentes.com/";

//$agent= 'Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$agent = 'facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)';

$ch = curl_init();
//curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
//curl_setopt($ch, CURLOPT_HEADER, 0);
//curl_setopt($ch, CURLOPT_POST, 0);
//curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
//curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
//curl_setopt($ch, CURLOPT_MAXREDIRS, 20);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
//curl_setopt($ch, CURLOPT_LOW_SPEED_LIMIT, 1); 
//curl_setopt($ch, CURLOPT_LOW_SPEED_TIME, 360); 
//curl_setopt($ch, CURLOPT_IGNORE_CONTENT_LENGTH, 1); 
//curl_setopt($ch, CURLOPT_TCP_NODELAY, 1); 
curl_setopt($ch, CURLOPT_HTTPHEADER,$headers);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
echo "code: ".curl_getinfo($ch,CURLINFO_HTTP_CODE ).PHP_EOL;
//echo $result;

And you can see in comments i have checke a lot of different solutions, different agents, different curl options but i get always a 403 code.

curl command line sh code is

curl -I -vvv 'https://www.pccomponentes.com/' \
  -H 'authority: www.pccomponentes.com' \
  -H 'sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="90", "Google Chrome";v="90"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
  -H 'sec-fetch-site: none' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-user: ?1' \
  -H 'sec-fetch-dest: document' \
  -H 'accept-language: es-ES,es;q=0.9' \
  --compressed

To check with google chrome i open a secure window with not cookies at all, and then i open inspector and i write the url.

the output of the script (it is the same with command-line curl) is

*   Trying 104.16.162.71:443...
* TCP_NODELAY set
* Connected to www.pccomponentes.com (104.16.162.71) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=CA; L=San Francisco; O=Cloudflare, Inc.; CN=sni.cloudflaressl.com
*  start date: Aug 11 00:00:00 2020 GMT
*  expire date: Aug 11 12:00:00 2021 GMT
*  subjectAltName: host "www.pccomponentes.com" matched cert's "*.pccomponentes.com"
*  issuer: C=US; O=Cloudflare, Inc.; CN=Cloudflare Inc ECC CA-3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0xaaab008552b0)
> GET /listado/ajax?idShops%5B%5D=0&page=0&order=price-desc&gtmTitle=Tarjetas%20Gr%C3%A1ficas&idFamilies%5B%5D=6 HTTP/2
Host: www.pccomponentes.com
user-agent: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-language: es-ES,es;q=0.9
upgrade-insecure-requests: 1

* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
< HTTP/2 403 
< date: Sat, 01 May 2021 09:28:32 GMT
< content-type: text/html; charset=UTF-8
< cf-chl-bypass: 1
< set-cookie: __cfduid=db6d6b293bbc3a77fe7f7b90ec55cebc31619861312; expires=Mon, 31-May-21 09:28:32 GMT; path=/; domain=.pccomponentes.com; HttpOnly; SameSite=Lax
< cache-control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< expires: Thu, 01 Jan 1970 00:00:01 GMT
< x-frame-options: SAMEORIGIN
< cf-request-id: 09c8db2a8c0000611f910c2000000001
< expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
< server: cloudflare
< cf-ray: 6487faf0d82d611f-BCN
< 
* Connection #0 to host www.pccomponentes.com left intact
code: 403

I have been searching information for:

  • old SSL session ID is stale, removing

but no luck.

What kind of protection is it using?, i saw something about js but it is not even loaded when it is already returning a 403 code. I saw some comments about catpcha but thats not possible before any sending.. chrome is returning code 200 and curl 403.

I have tried with HTTP/1.1 too, with different enconding, with gzip, etc... no luck at all.

It seems that they changed the system recently.


Solution

  • cloudflare examines the headers and requests it receives to determine if the sender is a robot You can send your request even without any headers and additional items, if the server side is not checked, there is no problem, but in the cases that are checked, you should try to make your requests as similar as the request requested by the client. The browser will be sent

    This is the default answer for the first time you open the page In the browser, if you open it for the first time, the result is 403 But next time it is not like this because of the cookies You can use the same cookies in your request

    enter image description here

    for test : You can delete your desired cookie And reload the page For the first time, if you do not have a cookie, you will encounter 403 and captcha again

    enter image description here

    Example:

    $options = [
        CURLOPT_URL => "https://www.pccomponentes.com/",
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_SSL_VERIFYHOST => false,
        CURLOPT_SSL_VERIFYPEER => false,
        CURLOPT_HTTPHEADER => [
            'accept: application/json, text/plain, */*',
            'Accept-Language: en-US,en;q=0.5',
            'x-application-type: WebClient',
            'x-client-version: 2.10.4',
            'Origin: https://www.googe.com',
            'user-agent: Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0',
        ]
    ];
    
    $ch = curl_init();
    curl_setopt_array($ch, $options);
    $result = curl_exec($ch);
    curl_close($ch);
    print_r($result);
    

    Result:

    enter image description here

    The request you send from php does not have a cookie, so you will always encounter 403 You can use CURLOPT_COOKIEJAR and CURLOPT_COOKIEFILE to use cookies in php with curl

    https://curl.se/docs/http-cookies.html