Search code examples
javaweb-scrapingjsoup

Scraping XML with JSoup


I'm trying to scrape an RSS feed located here.

At the moment I'm just trying to wrap my head around JSoup, so the following code is merely proof of concept (or an attempt at it, at least).

    public static void grabShakers(String url) throws IOException {

    doc = Jsoup.connect(url).get();


    desc = doc.select("title");
    links = doc.select("link");
    price = doc.select("span.price");

}

It grabs the title of each item perfectly. The output of each link is simply ten repeated closing link tags and it never finds any prices. I thought perhaps the CDATA was the issue, so I converted doc to html, stripped out the comments using .replace, and then converted it back to a Document for parsing to no avail. Any insight would be greatly appreciated.

The following code is what I'm using to print out each element:

for (Element src : price) {
        System.out.println(src);
    }

Solution

  • There are two Problems with that feed:

    1. The document contains only <link />..actual link.. instead of full link tag
    2. The description (containing the price tag) is escaped Html, which wont get parsed

    Solution:

        final String url = "http://www.amazon.com/gp/rss/movers-and-shakers/appliances/ref=zg_bsms_appliances_rsslink";
        Document doc = Jsoup.connect(url).get();
    
    
        for( Element item : doc.select("item") ) // Select all items
        {
            final String title = item.select("title").first().text(); // select the 'title' of the item
            final String link = item.select("link").first().nextSibling().toString().trim(); // select 'link' (-1-)
    
            final Document descr = Jsoup.parse(StringEscapeUtils.unescapeHtml4(item.select("description").first().toString()));
            final String price = descr.select("span.price").first().text(); // select 'price' (-2-)
    
            // Output - Example
            System.out.println(title);
            System.out.println(link);
            System.out.println(price);
            System.out.println();
        }
    

    Note 1: Workaround for the link; select the (empty) link tag and get the text of next Node (= TextNode with the actual link).

    Note 2: Workaround for price; select the description tag, unescape the html, parse it and select the price. For unescaping i used StringEscapeUtils.unescapeHtml4() from Apache Commons Lang.

    Output:
    (using link from above)

    #1: Epicurean Gourmet Series 20-Inch-by-15-Inch Cutting Board with Cascade Effect, Nutmeg with Natural Core
    http://www.amazon.com/Epicurean-Gourmet-20-Inch-15-Inch-Cutting/dp/B003MU9PLU/ref=pd_zg_rss_ms_la_appliances_1
    $72.95
    
    #2: GE 45600 Z-Wave Basic Handheld Remote
    http://www.amazon.com/GE-45600-Z-Wave-Handheld-Remote/dp/B0013V6RW0/ref=pd_zg_rss_ms_la_appliances_2
    $3.00
    
    #3: First Alert RD1 Radon Gas Test Kit
    http://www.amazon.com/First-Alert-RD1-Radon-Test/dp/B00002N83E/ref=pd_zg_rss_ms_la_appliances_3
    $10.60
    
    #4: Presto 04820 PopLite Hot Air Popper, White
    http://www.amazon.com/Presto-04820-PopLite-Popper-White/dp/B00006IUWA/ref=pd_zg_rss_ms_la_appliances_4
    $9.99
    
    #5: New 20 oz Espresso Coffee Milk Frothing Pitcher, Stainless Steel, 18/8 gauge
    http://www.amazon.com/Espresso-Coffee-Frothing-Pitcher-Stainless/dp/B000FNK3Z4/ref=pd_zg_rss_ms_la_appliances_5
    $8.19
    
    #6: PUR 18 Cup Dispenser with One Pitcher Filter DS-1800Z
    http://www.amazon.com/PUR-Dispenser-Pitcher-Filter-DS-1800Z/dp/B0006MQCA4/ref=pd_zg_rss_ms_la_appliances_6
    $22.17
    
    #7: Hamilton Beach 70610 500-Watt Food Processor, White
    http://www.amazon.com/Hamilton-Beach-70610-500-Watt-Processor/dp/B000SAOF5S/ref=pd_zg_rss_ms_la_appliances_7
    $21.95
    
    #8: West Bend 77203 Electric Can Opener, Metallic
    http://www.amazon.com/West-Bend-77203-Electric-Metallic/dp/B00030J1U2/ref=pd_zg_rss_ms_la_appliances_8
    $35.79
    
    #9: Custom Leathercraft 2077L Black Ski Glove, Large
    http://www.amazon.com/Custom-Leathercraft-2077L-Black-Glove/dp/B00499BS9A/ref=pd_zg_rss_ms_la_appliances_9
    $8.83
    
    #10: Cuisinart CPC-600 1000-Watt 6-Quart Electric Pressure Cooker, Brushed Stainless and Matte Black
    http://www.amazon.com/Cuisinart-CPC-600-1000-Watt-Electric-Stainless/dp/B000MPA044/ref=pd_zg_rss_ms_la_appliances_10
    $64.95