Search code examples
c#facebook-opengraphwebclient

C# WebClient receives 403 when getting html from a site


I am trying to download the HTML from a site and parse it. I am actually interested in the OpenGraph data in the head section only. For most sites using the WebClient, HttpClient or HtmlAgilityPack works, but some domains I get 403, for example: westelm.com

I have tried setting up the Headers to be absolutely the same as they are when I use the browser, but I still get 403. Here is some code:

string url = "https://www.westelm.com/m/products/brushed-herringbone-throw-t5792/?";

var doc = new HtmlDocument();

using(WebClient client = new WebClient()) {
  client.Headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36";
  client.Headers["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9";
  client.Headers["Accept-Encoding"] = "gzip, deflate, br";
  client.Headers["Accept-Language"] = "en-US,en;q=0.9";
  doc.Load(client.OpenRead(url));
}

At this point, I am getting a 403.

Am I missing something or the site administrator is protecting the site from API requests?

How can I make this work? Is there a better way to get OpenGraph data from a site?

Thanks.


Solution

  • I used your question to resolve the same problem. IDK if you're already fixed this but I tell you how it worked for me

    A page was giving me 403 for the same reasons. The thing is: you need to emulate a "web browser" from the code, sending a lot of headers.

    I used one of yours headers I wasn't using (like Accept-Language)

    I didn't use WebClient though, I used HttpClient to parse the webpage

    private static async Task<string> GetHtmlResponseAsync(HttpClient httpClient, string url)
        {
            using var request = new HttpRequestMessage(HttpMethod.Get, new Uri(url));
            request.Headers.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9");
            request.Headers.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate, br");
            request.Headers.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36");
            request.Headers.TryAddWithoutValidation("Accept-Charset", "UTF-8");
            request.Headers.TryAddWithoutValidation("Accept-Language", "en-US,en;q=0.9");
    
            using var response = await httpClient.SendAsync(request).ConfigureAwait(false);
    
            if (response == null)
                return string.Empty;
    
            using var responseStream = await response.Content.ReadAsStreamAsync().ConfigureAwait(false);
            using var decompressedStream = new GZipStream(responseStream, CompressionMode.Decompress);
            using var streamReader = new StreamReader(decompressedStream);
            return await streamReader.ReadToEndAsync().ConfigureAwait(false);
        }
    

    If it helps you, I'm glad. If not, I will leave this answer here to help someone else in the future!