I would like to grab some content from a website that is made with Drupal. The challenge here is that i need to login on this site before i can access the page i want to scrape. Is there a way to automate this login process in my C# code, so i can grab the secure content?
To access the secured content, you'll need to store and send cookies with every request to your server, starting with the request that sends your log in info and then saving the session cookie that the server gives you (which is your proof that you are who you say you are).
You can use the System.Windows.Forms.WebBrowser
for a less control but out-of-the-box solution that will handle cookies.
My preferred method is to use System.Net.HttpWebRequest
to send and receive all web data and then use the HtmlAgilityPack to parse the returned data into a Document Object Model (DOM) which can be easily read from.
The trick to getting System.Net.HttpWebRequest
to work is that you must create a long-lived System.Net.CookieContainer
that will keep track of your log in info (and other things the server expects you to keep track of). The good news is that the HttpWebRequest
will take care of all of this for you if you provide the container.
You need a new HttpWebRequest
for each call you make, so you must sets their .CookieContainer
to the same object every time. Here is an example:
using System.Net;
public void TestConnect()
{
CookieContainer cookieJar = new CookieContainer();
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.mysite.com/login.htm");
request.CookieContainer = cookieJar;
HttpWebResponse response = (HttpWebResponse) request.GetResponse();
// do page parsing and request setting here
request = (HttpWebRequest)WebRequest.Create("http://www.mysite.com/submit_login.htm");
// add specific page parameters here
request.CookeContainer = cookieJar;
response = (HttpWebResponse) request.GetResponse();
request = (HttpWebRequest)WebRequest.Create("http://www.mysite.com/secured_page.htm");
request.CookeContainer = cookieJar;
// this will now work since you have saved your authentication cookies in 'cookieJar'
response = (HttpWebResponse) request.GetResponse();
}
http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.aspx
http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.cookiecontainer.aspx