Search code examples
c#web-scrapinghttpclientwebclient

WebClient return 403 when trying to use DownloadFile()


I am starting web scrapper and wanted to scrape something i might use once. As an example i want to scrape this image (https://thebarchive.com/b/full_image/1707085883033680.jpg) using WC DownloadFile function. It just returns error 403. As you can see below, there is a ton of headers i am adding, i just threw out some of them, but i tried to copy most of the headers that i sent when i try to access the image normally (i found them with fiddler). I am getting desperate, maybe someone can help me figure out what's the problem. important note is that just yesterday both httpClient and WebClient worked well without adding Headers, but today they refuse to

            wc.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36");
            wc.Headers.Add("Host", "thebarchive.com");
            wc.Headers.Add("Content-Type","application/x-www-form-urlencoded");
            wc.Headers.Add("Cache-Control","max-age=0");
            wc.Headers.Add("Content-Length","0");
            wc.Headers.Add("origin", "thebarchive.com");
            wc.Headers.Add("upgrade-insecure-requests","1");
            wc.Headers.Add("accept-encoding","gzip, deflate, br");
            wc.Headers.Add("cookie","cf_chl_3=0e3bd5e051a1c24");


            // i pass both WC and httpClient into the method where i use this code to download picture
            // below is just me getting my way to the image, it works perfectly fine, the only problem is with the dowloading the picture


            var html = httpClient.GetStringAsync(ThreadLink).Result;
            var htmlDocument = new HtmlDocument();
            htmlDocument.LoadHtml(html);
            var piclinks = htmlDocument.DocumentNode.Descendants("div")
                .Where(node => node.GetAttributeValue("class", "")
                .Contains("thread_image_box")).ToList();

            foreach(var imagelink in piclinks)
            {
                string link = imagelink.InnerHtml;

                // get the picture name because somehow the link is stored not as direct thebarchive link
                // but as a archived.moe reference to that link
                // which is not even how its stored on the website if you navigate to picture using ctrl+shit+c

                string linkfull = link.Substring(link.IndexOf("t/")+2,link.IndexOf("target=")-2-(link.IndexOf("t/")+2));
                string downloadlink = "https://thebarchive.com/b/full_image/" + linkfull;

                try 
                {
                    wc.DownloadFile(downloadlink, Path.Combine(folder,linkfull));
                }
                catch (Exception ex)
                {
                    Console.WriteLine("Couldn't load file, most likely a video");
                    Console.WriteLine(ex);
                }

Adding headers. It worked with httpClient and me getting my hands on the image link, but 0 progress with downloading the picture.


Solution

  • Problem wasn't with any headers or anything. It was website suddenly getting cloudflare anti-bot protection. To bypass it FlareSolverrSharp and private proxy was used.

    In order to use FlareSolverrSharp instead of wc.DownloadFile() was used httpClient download method:

    wc.DownloadFile(downloadlink, Path.Combine(folder,linkfull));
    
    // became
    
    byte[] imageBytes = await httpClient.GetByteArrayAsync(downloadlink);
    File.WriteAllBytes(Path.Combine(folder,linkfull), imageBytes);