Search code examples
c#.netwebclient

Get HTML from Web.Client


I am trying to pull down the HTML from this URL programmatically: https://www.parkrun.com.au/lewisparkreserve/results/latestresults/, but it is detecting I am not a browser.

My first attempt was just this which returned a 403 forbidden

WebClient _client = new WebClient();
string _html = _client.DownloadString("https://www.parkrun.com.au/lewisparkreserve/results/latestresults/");`

My second attempt after goggling involved setting some headers including user agent.

ServicePointManager.Expect100Continue = true;
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
WebClient _client = new WebClient();
_client.Headers[HttpRequestHeader.Accept] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7";
_client.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate, br";
_client.Headers[HttpRequestHeader.AcceptLanguage] = "en-GB,en;q=0.9,en-US;q=0.8";
_client.Headers[HttpRequestHeader.CacheControl] = "max-age=0";
_client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0";

This 2nd attempt avoids the 403 error, but instead of returning the HTML I would get by going to this URL in a browser, I get HTML asking me to prove I am not a robot. I am only doing a one-off call, so it can't detect too many requests from my IP. I assume I am still missing something, possibly in the HTTP headers.


Solution

  • Firstly, WebClient is obsolete now. I would recommend using HttpClient.

    I tried the following code below, and this worked. Let me know how that goes:

    HttpClient _client = new HttpClient();
    _client.DefaultRequestHeaders.UserAgent.ParseAdd("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537");
    var response = _client.GetAsync("https://www.parkrun.com.au/lewisparkreserve/results/latestresults/").Result;
    string _html = response.Content.ReadAsStringAsync().Result;
    Console.WriteLine(_html);
    }