Getting data from a webpage (screenscraping)

Can someone please give me a good tutorial on screen scraping. I have a webpage that my university uses to upload all the data for each class. To get on to the home page for their site there is an entry screen which has a log in button. When pressed it brings up a floating dialog asking for user name and password. Then it goes straight to the homepage. I do not know where it is requesting for authorisation and i would like to be able to get data from the site programmatically. The data i require is through many more screens with logins but if i can get passed this first screen with my id and password I will be happy enough. preferably i would like this in java but any language will do

Solution

This sounds like the login dialog is not part of the original page, but constructed on the fly by some JavaScript, possibly through Ajax calls.

What you will need is some sort of headless browser, that supports javaScript and Ajax.

Have a look at HtmlUnit (http://htmlunit.sourceforge.net/), from the introduction:

HtmlUnit is a "GUI-Less browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser.

It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating either Firefox or Internet Explorer depending on the configuration you want to use.

Edit: here is an example:

I noticed that the page you want to scan (http://qub.ac.uk/qol/) uses basic authentication, so it is not some kind of HTML input form that pops up, but a browser dialog. When you press the 'Login' button on the start page, a page https://qub.ac.uk/qol/ will be loaded, which is secured in that way.

For a test, I only show you how to get the heading from the unsecured http://qub.ac.uk/qol/ page using HtmlUnit, because I have no access to the secret parts, of course.

I think, it should be clear how it works in general. Consult the excellent documentation and other resources on the web for more details on how to use the HtmlUnit API.

package test;

import java.io.IOException;
import java.net.MalformedURLException;

import javax.xml.bind.DatatypeConverter;

import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.DomElement;
import com.gargoylesoftware.htmlunit.html.DomNodeList;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class Scraper {

    public static void main(String[] args)
            throws FailingHttpStatusCodeException, MalformedURLException,
            IOException {
        WebClient webClient = new WebClient();

        String username = "user";
        String password = "pw";
        String authString = username + ":" + password;
        String authEncoded = DatatypeConverter.printBase64Binary(authString
                .getBytes());

        webClient.addRequestHeader("Authorization", "Basic " + authEncoded);

        HtmlPage page = webClient.getPage("http://qub.ac.uk/qol/");
        // System.out.println(page.asXml());
        DomNodeList<DomElement> headings = page.getElementsByTagName("h3");
        for (DomElement e : headings) {
            System.out.println("Got heading: " + e.getTextContent());
        }

    }

}