Search code examples
javahttpweb-scrapingparametersresponse

Scrape information from Web Pages with Java?


I'm trying to extract data from a webpage, for example, lets say I wish to fetch information from chess.org.

I know the player's ID is 25022, which means I can request http://www.chess.org.il/Players/Player.aspx?Id=25022

In that page I can see that this player's fide ID = 2821109.
From that, I can request this page:
http://ratings.fide.com/card.phtml?event=2821109

And from that I can see that stdRating=1602.

How can I get the "stdRating" output from a given "localID" input in Java?

(localID, fideID and stdRating are aid parameters that I use to clarify the question)


Solution

  • As @Alex R pointed out, you'll need a Web Scraping library for this.
    The one he recommended, JSoup, is quite robust and is pretty commonly used for this task in Java, at least in my experience.

    You'd first need to construct a document that fetches your page, eg:

    int localID = 25022; //your player's ID.
    Document doc = Jsoup.connect("http://www.chess.org.il/Players/Player.aspx?Id=" + localID).get();
    

    From this Document Object, you can fetch a lot of information, for example the FIDE ID you requested, unfortunately the web page you linked inst very simple to scrape, and you'll need to basically go through every link on the page to find the relevant link, for example:

    Elements fidelinks = doc.select("a[href*=fide.com]");
    

    This Elements object should give you a list of all links that link to anything containing the text fide.com, but you probably only want the first one, eg:

    Element fideurl = doc.selectFirst("a[href=*=fide.com]");
    

    From that point on, I don't want to write all the code for you, but hopefully this answer serves as a good starting point!

    You can get the ID alone by calling the text() method on your Element object, but You can also get the link itself by just calling Element.attr('href')

    The css selector you can use to get the other value is div#main-col table.contentpaneopen tbody tr td table tbody tr td table tbody tr:nth-of-type(4) td table tbody tr td:first-of-type, which will get you the std score specifically, at least with standard css, so this should work with jsoup as well.