Search code examples
c#htmlhttp-redirecthttpwebrequesthtml-agility-pack

How to retrieve HTML Page without getting redirected?


I want to scrape the HTML of a website. When I access this website with my browser (no matter if it is Chrome or FireFox), I have no problem accessing the website + HTML.

When I try to parse the HTML with C# using methods like HttpWebRequest and HtmlAgilityPack, the website redirects me to another website and thus I parse the HTML of the redirected website.

Any idea how to solve this problem?

I thought the site recognises my program as a program and redirects immediately, so I tried using Selenium and a GoogleDriver and FireFoxDriver but also no luck, I get redirected immediately.

The Website: https://www.jodel.city/7700#!home

private void bt_load_Click(object sender, EventArgs e)
{
        var url = @"https://www.jodel.city/7700#!home";
        var req = (HttpWebRequest)WebRequest.Create(url);
        req.AllowAutoRedirect = false;
        // req.Referer = "http://www.muenchen.de/";
        var resp = req.GetResponse();
        StreamReader sr = new StreamReader(resp.GetResponseStream());
        String returnedContent = sr.ReadToEnd();

        Console.WriteLine(returnedContent);
        return;
}

Solution

  • And of course, cookies are to blame again, because cookies are great and amazing.

    So, let's look at what happens in Chrome the first time you visit the site:

    (I went to https://www.jodel.city/7700#!home):

    enter image description here

    Yes, I got a 302 redirect, but I also got told by the server to set a __cfduid cookie (twice actually).

    When you visit the site again, you are correctly let into the site:

    enter image description here

    Notice how this time a __cfduid cookie was sent along? That's the key here.

    Your C# code needs to:

    1. Go to the site once, get redirected, but obtain the cookie value from the response header.
    2. Go BACK to the site with the correct cookie value in the request header.

    You can go to the first link in this post to see an example of how to set cookie values for requests.