Search code examples
javascripthtmlcheckboxparsingscreen-scraping

ASP.NET Screen Scrape Post Simulate


I'm trying to download and parse the HTML of a web page. Recently, the source website moved from having all of their information on one page to hiding part of it behind javascript. There's a "Show All" check box that needs activated in order to view the whole page.

Here's the website: Source Website

Essentially I'm looking to automate retrieving that page after the check box has been clicked. Currently, we have a C program that downloads the web page and handles our parsing. I'm not sure if it can accept javascript in the URL if that can be used to solve this problem (I've tried using a bookmarklet to call the javascript from the URL, but I wasn't able to get it to handle the check box), but it can handle files if it's easier to write a C# program that can handle this.

I would prefer a way to code this myself rather than use a third party program to avoid having to install anything on the server this runs on. Any help is greatly appreciated.


Edit: Basically, how can I automate the call to the javascript that is linked to that "Select All" checkbox so I can grab the html page containing everything's that's displayed after clicking the checkbox.


Edit 2: Here's the output from Fiddler2:

__EVENTTARGET ctl00$ContentPlaceHolder1$GenericWebUserControl$ShowAllCheckBox
__EVENTARGUMENT
__LASTFOCUS
__VIEWSTATE (REMOVED DUE TO LENGTH)
__EVENTVALIDATION (REMOVED DUE TO LENGTH)
ctl00$ContentPlaceHolder1$GenericWebUserControl$Organization0 ALL
ctl00$ContentPlaceHolder1$GenericWebUserControl$Initial or Amendment1 ALL
ctl00$ContentPlaceHolder1$GenericWebUserControl$Relief Requested2 ALL
ctl00$ContentPlaceHolder1$GenericWebUserControl$Country3 ALL
ctl00$ContentPlaceHolder1$GenericWebUserControl$Status4 ALL
ctl00$ContentPlaceHolder1$GenericWebUserControl$StartDate5  
ctl00$ContentPlaceHolder1$GenericWebUserControl$EndDate5    
ctl00$ContentPlaceHolder1$GenericWebUserControl$ShowAllCheckBox on

I'm currently getting 500 ERRORS from the server. Do I need to include all of those GenericWebUserControls in the post request as well? Also do I need to include the EVENTVALIDATION?


EDIT 3: Here's the latest code. I'm still getting server 500 errors.

private void CreateRequest()
{
    HttpWebRequest httpWebRequest;
    HttpWebResponse httpWebResponse;
    StreamWriter streamWriter;
    Stream webResponseStream;
    StreamReader streamReader;
    string postData;
    string outputHTML;

    postData = String.Format("&__EVENTTARGET={0}" + "&__VIEWSTATE={1}" + "&__EVENTVALIDATION=(2)"+"&ctl00$ContentPlaceHolder1$GenericWebUserControl$ShowAllCheckBox=on" +"&ctl00$ContentPlaceHolder1$GenericWebUserControl$Organization0=ALL" +"&ctl00$ContentPlaceHolder1$GenericWebUserControl$Initial+or+Amendment1=ALL" +"&ctl00$ContentPlaceHolder1$GenericWebUserControl$Relief+Requested2=ALL" +"&ctl00$ContentPlaceHolder1$GenericWebUserControl$Country3=ALL" +"&ctl00$ContentPlaceHolder1$GenericWebUserControl$Status4=ALL",EVENTTARGET, VIEWSTATE, EVENTVALIDATION);

    httpWebRequest = (HttpWebRequest)WebRequest.Create("http://services.cftc.gov/sirt/sirt.aspx?Topic=ForeignPart30Exemptions");
    httpWebRequest.Method = "POST";
    httpWebRequest.ContentType = "application/x-www-form-urlencoded";
    httpWebRequest.ContentLength = postData.Length;

    streamWriter = new StreamWriter(httpWebRequest.GetRequestStream(), System.Text.Encoding.ASCII);
    streamWriter.Write(postData);
    streamWriter.Close();

    httpWebResponse = (HttpWebResponse)httpWebRequest.GetResponse();

    webResponseStream = httpWebResponse.GetResponseStream();
    streamReader = new StreamReader(webResponseStream);
    outputHTML = streamReader.ReadToEnd();

    Console.WriteLine(outputHTML);
}

EDIT 4: I've determined that it's the postData string that's causing the server 500 error. If I make it an empty string, it outputs the entire webpage. Does anyone know if I'm correct in having to put everything that came from Fiddler2 that had a value into the postData string? Also, that __VIEWSTATE is an incredibly long string. Are there limits or anything I'm not sure about?


EDIT 5: I ran all of the strings used in postData through a URL encoder, but I'm still getting server 500 errors. Is there any way for me to debug why that post body is invalid?


SOLUTION: Ok, I couldn't get my postData string correct, but when I pasted in the raw POST body it works. This looks like it will be good enough, but my concern is if this will continue working.


Solution

  • That's an asp.net page. Clicking the checkbox causes the page to be posted back to the server. So rather than trying to simulate the javascript what you want to do instead is simulate the post request.

    This is notoriously tricky with ASP.Net pages, because you usually need to populate the hidden __ViewState input. I recommend using a packet sniffer like Fiddler to view the actual request as it's sent. You should be able to copy the ViewState from there.