Search code examples
phpcurldomdocumentfile-get-contents

File get content or cURL getting 404 page instead of main string


I was trying to get string from website but i am getting 404 page of external website instead of index page string.

I have tried with both cURL and file_get_contents. Both returning 404 from external website instead of returning the string of index page.

$homepage = file_get_contents("https://www.creditkarma.ca");
echo $homepage;

cURL :

$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';

function file_get_contents_curl($url) {
$ch = curl_init();

curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);   
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);   
curl_setopt($ch, CURLOPT_VERBOSE, true);    

$data = curl_exec($ch);
curl_close($ch);

return $data;
}
$homepage = file_get_contents_curl("https://www.creditkarma.ca");
echo $homepage;

The code should return the string of index page but it return the 404 page from external website. How can i solve this. i need a string of index page.

Note : it returning 404 of external website not from my .htaccess


Solution

  • With a CURL statement, if you want to retrieve the HTML of a page, you should be using headers. As a security precaution, a lot of websites will deny traffic (or result in 404) if browser information is not apparent. So when I do this .. I try to "emulate" my statement, as if it were a browser. Something like this should fit the bill -- As noted in your updated code above, you are not denoting an "agent":

    $url="https://www.creditkarma.ca";
    $agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
    
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_VERBOSE, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_USERAGENT, $agent);
    curl_setopt($ch, CURLOPT_URL,$url);
    $result=curl_exec($ch);
    var_dump($result);
    

    UPDATE

    I have tested this as a "stand alone" php script .. And get the following results:

    *   Trying 104.100.143.79:443...
    * TCP_NODELAY set
    * Connected to www.creditkarma.ca (104.100.143.79) port 443 (#0)
    * ALPN, offering h2
    * ALPN, offering http/1.1
    * successfully set certificate verify locations:
    *   CAfile: /etc/ssl/certs/ca-certificates.crt
      CApath: /etc/ssl/certs
    * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
    * ALPN, server accepted to use http/1.1
    * Server certificate:
    *  subject: businessCategory=Private Organization; jurisdictionC=US; jurisdictionST=Delaware; serialNumber=4313894; C=US; ST=California; L=San Francisco; O=Credit Karma Inc.; CN=www.creditkarma.ca
    *  start date: Mar 16 00:00:00 2020 GMT
    *  expire date: Mar 21 12:00:00 2022 GMT
    *  subjectAltName: host "www.creditkarma.ca" matched cert's "www.creditkarma.ca"
    *  issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=DigiCert SHA2 Extended Validation Server CA
    *  SSL certificate verify ok.
    > GET / HTTP/1.1
    Host: www.creditkarma.ca
    User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)
    Accept: */*
    
    * old SSL session ID is stale, removing
    * Mark bundle as not supporting multiuse
    < HTTP/1.1 200 OK
    < Content-Type: text/html; charset=utf-8
    < x-content-security-policy:
    < Server: CK-FG-server
    < Strict-Transport-Security: max-age=31536000; includeSubdomains; preload
    < X-Frame-Options: SAMEORIGIN
    < X-XSS-Protection: 1; mode=block
    < ORIGIN-ENV: production
    < ORIGIN-DC: us-east4
    < Expires: Wed, 12 Jan 2022 18:20:46 GMT
    < Cache-Control: max-age=0, no-cache, no-store
    < Pragma: no-cache
    < Date: Wed, 12 Jan 2022 18:20:46 GMT
    < Transfer-Encoding:  chunked
    < Connection: keep-alive
    < Connection: Transfer-Encoding
    < Set-Cookie: ck_cabf=IjA5MTRmMDQ2LTE3OTAtNDQ5MC1hODA3LWUzZTRlZDcwYTdlYSI=; Max-Age=31536000; Expires=Thu, 12 Jan 2023 18:20:46 GMT; Secure; SameSite=Strict; Path=/
    < Set-Cookie: ck_crumb=6da1442eb87cee1a6c0c08c56a9b07826949e3dc130925b0fcb774a83d566b71f5a9b634c4e4f198ae8dc4a6722abf41; Secure; HttpOnly; SameSite=Strict; Path=/
    < Set-Cookie: ck_trace_id=5544f4ea-9d03-462b-ab5f-8a81c70c6c81; HttpOnly; SameSite=Strict; Path=/
    < Set-Cookie: ck_lang=en; SameSite=Strict; Path=/
    <
    * Connection #0 to host www.creditkarma.ca left intact
    string(63139) "<!DOCTYPE html>
    <html>
        <head>
     ..... Rest of page here