Search code examples
javahtmlscreen-scrapingjsoup

To identify links regarding the Press Release pages alone


My task is to find the actual Press release links of a given link. Say http://www.apple.com/pr/ for example.

My tool has to find the press release links alone from the above URL excluding other advertisement links, tab links(or whatever) that are found in that site.

The program below is developed and the result this gives is, all the links that are present in the given webpage.

How can I modify the below program to find the Press Release links alone from a given URL? Also, I want the program to be generic so that it identifies press release links from any press release URLs if given.

import java.io.*;
import java.net.URL;
import java.net.URLConnection;
import java.sql.*;
import org.jsoup.nodes.Document;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element; 
public class linksfind{
public static void main(String[] args) {
    try{
         URL url = new URL("http://www.apple.com/pr/");
         Document document = Jsoup.parse(url, 1000); // Can also take an URL.
         for (Element element : document.getElementsByTag("a")) {
             System.out.println(element.attr("href"));}
             }catch (Exception ex){ex.printStackTrace();}
}
}

Solution

  • Look at the HTML source code. Open the page in a normal webbrowser, rightclick and choose View Source. You have to find a path in the HTML document tree to uniquely identify those links.

    They are all housed in a <ul class="stories"> element inside a <div id="releases"> element. The appropriate CSS selector would then be "div#releases ul.stories a".

    Here's how it should look like:

    public static void main(String... args) throws Exception {
        URL url = new URL("http://www.apple.com/pr/");
        Document document = Jsoup.parse(url, 3000);
        for (Element element : document.select("div#releases ul.stories a")) {
            System.out.println(element.attr("href"));
        }
    }
    

    This yields as of now, exactly what you want:

    /pr/library/2010/07/28safari.html
    /pr/library/2010/07/27imac.html
    /pr/library/2010/07/27macpro.html
    /pr/library/2010/07/27display.html
    /pr/library/2010/07/26iphone.html
    /pr/library/2010/07/23iphonestatement.html
    /pr/library/2010/07/20results.html
    /pr/library/2010/07/19ipad.html
    /pr/library/2010/07/19alert_results.html
    /pr/library/2010/07/02appleletter.html
    /pr/library/2010/06/28iphone.html
    /pr/library/2010/06/23iphonestatement.html
    /pr/library/2010/06/22ipad.html
    /pr/library/2010/06/16iphone.html
    /pr/library/2010/06/15applestoreapp.html
    /pr/library/2010/06/15macmini.html
    /pr/library/2010/06/07iphone.html
    /pr/library/2010/06/07iads.html
    /pr/library/2010/06/07safari.html
    

    To learn more about CSS selectors, read the Jsoup manual and the W3 CSS selector spec.