Search code examples
asp.nethttpsweb-scrapingscreen-scrapingajaxcontroltoolkit

ASP.NET form scraping not working


I'm trying to scrape some pages on a website that uses ASPX forms. The forms involve adding details of people by updating the server (one person at a time) and then proceeding to a results page that shows information regarding the specified people. There are 5 steps to the process:

  1. Hit the login page (the site is HTTPS) by sending a POST request with my credentials. The response will contain cookies that will be used to validate all subsequent requests.

  2. Hit the search criteria page by sending a GET request (no parameters). The only purpose of this is to discover the __VIEWSTATE and __EVENTVALIDATION tokens in the HTML response to be used in the next step.

  3. Update the server with a person. This involves hitting the same webpage in step 2 but using a POST request with form parameters that correspond to the form controls on the page for adding person details and their values. The form parameters will include the __VIEWSTATE and __EVENTVALIDATION tokens gained from the previous step. The server response will include a new __VIEWSTATE and __EVENTVALIDATION. This step can be repeated using the new __VIEWSTATE and __EVENTVALIDATION, or can proceed to the next step.

  4. Signal to the server that all people have been added. This involves hitting the same page as the previous 2 steps by sending a POST request with form parameters that correspond to the form controls on the page for signalling that all people have been added. The server response will simply be 25|pageRedirect||/path/to/results.aspx|.

  5. Hit the search results page specified in the redirect response from the previous step by sending a GET request (no parameters - cookies are enough). The server response will be the HTML that I need to scrape.


If I follow the process manually with any browser, filling in the form controls and clicking the buttons etc. (testing with just one person) I get to the results page and the results are fine. If I do this programmatically from an application running on my machine, then ultimately the search results HTML is wrong (the page returns valid HTML, but there are no results compared with the browser version and some null values were there should not be).

I've run this using a Java application with Apache HttpClient handling the requests. I've also tried it using a Ruby script with Mechanize handling the requests. I've setup a proxy server using Charles to intercept and examine all 5 HTTPS requests. Using Charles, I've scrutinized the raw requests (headers and body) and made comparisons between requests made using a browser and requests made using the application(s). They are all identical (except for the VIEWSTATE / EVENTVALIDATION values and session cookie values, which I would expect to differ).

A few additional points about the programmatic attempts:

  • The login step returns successful data, and the cookies are valid (otherwise the subsequent requests would all fail)
  • Updating the server with a person (step 3) returns successful responses, in that they are the same as would be returned from interaction using a browser. I can only assume this must mean the server is updating successfully with the person added.
  • A custom header is being added to requests in step 3 X-MicrosoftAjax: Delta=true (just like the browser requests are doing)
  • I don't own or have access to the server I'm scraping

Given that my application requests are identical to the browser requests that succeed, it baffles me that the server is treating them differently somehow. I can't help but feel that this is an ASP.net issue with forms that I'm overlooking. I'd appreciate any help.

Update:
I went over the raw requests again a bit more methodically, and it turns out I was missing something in the form parameters of the requests. Unfortunately, I don't think it will be of much use to anyone else, because it would seem to be specific to this particular ASP servers logic.

The POST request that notifies the server that all people have been added (step 4) requires two form parameters specifying the county and address of the last person that was added to the search. I was including these form parameters in my request, but the values were empty strings. I figured the browser request was just snagging these values because when the user hits the Continue button on the form, those controls would have the values of the last person added. I figured they wouldn't matter and forgot about them, but I was wrong.

It's a peculiar issue that I should have caught the first time. I can't complain though, I am scraping a site after all.


Solution

  • Review Charles logs again. It is possible that the search results and other content may be coming over via Ajax, and that your Java/Ruby apps are not actually doing all of the requests/responses that happen with the browser. Look for any POST or GET requests in between the requests you are already duplicating. If search results are populated via Javascript your client app may not be able to handle this?