Search code examples
c#html-agility-packwebrequest

Can't download web page in .net


I did a batch that parse html page of gearbest.com to extract data of the items (example link link). It worked until 2-3 week ago after that the site was updated. So I can't dowload pages to parse and I don't undastand why. Before the update I did request with the following code with HtmlAgilityPack.

HtmlWeb web = new HtmlWeb();    
HtmlDocument doc = null;    
doc = web.Load(url); //now this the point where is throw the exception

I tried without the framework and I added some date to the request

HttpWebRequest request = (HttpWebRequest) WebRequest.Create("https://it.gearbest.com/tv-box/pp_009940949913.html");
request.Credentials = CredentialCache.DefaultCredentials;
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36";
request.ContentType = "text/html; charset=UTF-8";
request.CookieContainer = new CookieContainer();
request.Headers.Add("accept-language", "it-IT,it;q=0.9,en-US;q=0.8,en;q=0.7");
request.Headers.Add("accept-encoding", "gzip, deflate, br");
request.Headers.Add("upgrade-insecure-requests", "1");
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8";
request.CookieContainer = new CookieContainer();

Response response = request.GetResponse();  //exception

the exception is:

  • IOException: Unable to read data from the transport connection
  • SocketException: The connection could not be established.

If I try to request the main page (https://it.gearbest.com) it works.

What's the problem in your opinion?


Solution

  • For some reason it doesn't like the provided user agent. If you omit setting UserAgent everything works fine

    HttpWebRequest request = (HttpWebRequest) WebRequest.Create("https://it.gearbest.com/tv-box/pp_009940949913.html");
    request.Credentials = CredentialCache.DefaultCredentials;
    //request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36";
    request.ContentType = "text/html; charset=UTF-8";
    

    Another solution would be setting request.Connection to a random string (but not keep-alive or close)

    request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36";
    request.Connection = "random value";
    

    It also works but I cannot explain why.