Search code examples
c#drupalhttpwebrequestscreen-scraping

Grab the contents of a Drupal website that is secured with a login form


I would like to grab some content from a website that is made with Drupal. The challenge here is that i need to login on this site before i can access the page i want to scrape. Is there a way to automate this login process in my C# code, so i can grab the secure content?


Solution

  • To access the secured content, you'll need to store and send cookies with every request to your server, starting with the request that sends your log in info and then saving the session cookie that the server gives you (which is your proof that you are who you say you are).

    You can use the System.Windows.Forms.WebBrowser for a less control but out-of-the-box solution that will handle cookies.

    My preferred method is to use System.Net.HttpWebRequest to send and receive all web data and then use the HtmlAgilityPack to parse the returned data into a Document Object Model (DOM) which can be easily read from.

    The trick to getting System.Net.HttpWebRequest to work is that you must create a long-lived System.Net.CookieContainer that will keep track of your log in info (and other things the server expects you to keep track of). The good news is that the HttpWebRequest will take care of all of this for you if you provide the container.

    You need a new HttpWebRequest for each call you make, so you must sets their .CookieContainer to the same object every time. Here is an example:

    UNTESTED

    using System.Net;
    
    public void TestConnect()
    {
        CookieContainer cookieJar = new CookieContainer();
    
        HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.mysite.com/login.htm");
        request.CookieContainer = cookieJar;
        HttpWebResponse response = (HttpWebResponse) request.GetResponse();
    
        // do page parsing and request setting here
        request = (HttpWebRequest)WebRequest.Create("http://www.mysite.com/submit_login.htm");
        // add specific page parameters here
        request.CookeContainer = cookieJar;
        response = (HttpWebResponse) request.GetResponse();
    
        request = (HttpWebRequest)WebRequest.Create("http://www.mysite.com/secured_page.htm");
        request.CookeContainer = cookieJar;
        // this will now work since you have saved your authentication cookies in 'cookieJar'
        response = (HttpWebResponse) request.GetResponse();
    }
    

    http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.aspx

    HttpWebRequest Class

    http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.cookiecontainer.aspx