Search code examples
c#web-scrapingdotnet-httpclientincapsula

HttpClient - Different content returned than browser


I'm trying to make a request to kicksusa.com. If I make the request from any browser, I get the full expected HTML, however, I cannot seem to simulate the request in a way that returns the same HTML, instead I get a 'Request unsuccessful.' message.

Any help is appreciated

My code:

HttpClientHandler httpClientHandler = new HttpClientHandler()
{
    //Proxy = proxy,
    AllowAutoRedirect = true,
    MaxAutomaticRedirections = 15,
    AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate | DecompressionMethods.None
};

var client = new HttpClient();
client.DefaultRequestHeaders.Add("Host", "www.kicksusa.com");
client.DefaultRequestHeaders.Add("Connection", "keep-alive");
client.DefaultRequestHeaders.Add("Upgrade-Insecure-Requests", "1");
client.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.87 Safari/537.36");
client.DefaultRequestHeaders.Add("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
client.DefaultRequestHeaders.Add("Accept-Encoding", "gzip, deflate, sdch");
client.DefaultRequestHeaders.Add("Accept-Language", "en-GB,en-US;q=0.8,en;q=0.6");


var _response = await client.GetAsync("http://www.kicksusa.com/jordan-craig/oil-stain-slub-tee-army-green-8909ag.html");

if (_response.IsSuccessStatusCode)
{
    var _html = await _response.Content.ReadAsStringAsync();
}

Fiddler trace headers:

Host: www.kicksusa.com
Connection: keep-alive
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.87 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-GB,en-US;q=0.8,en;q=0.6

Solution

  • This website uses some dedicated technology from Incapsula to prevent automated access to the website.

    On the first request, the site returns a web document with an embedded iframe. Only when the iframe source is then loaded, a cookie is set and a redirect to the page happens. All further requests will then succeed immediately because the browser sends the cookie information.

    In order to circumvent the mechanism, you would have to load the iframe after the first request, remember the cookie and then send the cookie for all further requests. There's also a lot of JavaScript code involved in the first answer which would probably have to be executed for the Incapsula check to succeed.

    However, when the site specifically uses such a technology to prevent automatic access to its content, any attempt to circumvent this mechanism, must be considered undesired and as a criminal act. You should not try to automatically gather data from a site without its owner's approval, specifically not when such a technology as Incapusla is used to make this more difficult.

    See also this answer by an Incapsula employee for more details.