Search code examples
javajsoup

JSOUP scraping a data


Hi i need to scrape a web site using JSOUP and i needed to get output in key- value pairs can anyone suggest me.
The url which i need to scrape is https://www.cpsc.gov/Recalls?field_rc_date_value%5Bmin%5D&field_rc_date_value%5Bmax%5D&field_rc_heading_value=&field_rc_hazard_description_value=&field_rc_manufactured_in_value=&field_rc_manufacturers_value=&field_rc_number_value=

The code which i written is:

package com.jaysons;  
import java.io.IOException;  
import org.jsoup.Jsoup;  
import org.jsoup.nodes.Document;  
import org.jsoup.nodes.Element;  
import org.jsoup.select.Elements;  

public class ScrapeBody {  
public static void main( String[] args ) throws IOException{  
String url = "https://www.cpsc.gov/Recalls?field_rc_date_value%5Bmin%5D&field_rc_date_value%5Bmax%5D&field_rc_heading_value=&field_rc_hazard_description_value=&field_rc_manufactured_in_value=&field_rc_manufacturers_value=&field_rc_number_value=";  
Document doc = Jsoup.connect(url).get();  
    
Elements content = doc.select("div.views-field views-field-php");      
doc = Jsoup.parse( content.html().replaceAll("</div>", "</div><span>")
.replaceAll("<div", "</span><div") );  
Elements labels = doc.select("div.remedy");  
for (Element label : labels) {  
System.out.println(String.format("%s %s", label.text().trim(),                                                                                    
label.nextElementSibling().text()));
}  
}        
}

i need output in key value pairs like
Date:OCTOBER 20, 2017
remedy:
units:
website:http://www.bosch-home.com/us
phone:(888) 965-5813

kindly let me know where did i do mistake


Solution

  • Theres no need to reassign and re-parse the value of the content variable.

    Elements content = doc.select("div.views-field >span");
    for (Element viewField : content) {
        /*
            each viewField corresponds to one
            <div class="views-field views-field-php"> 
              <span class="field-content">
                <a href="/Recalls/2018/BSH-Home-Appliances-amplía-retiro-del-mercado-de-lavavajillas">
                <div class="date">
                  October 20, 2017
                </div>
                ...
              </span>
            </div>
        */
        Elements divs = viewField.getElementsByTag("div");
        for (Element div : divs) {
          String className = div.className();
          if (className.equals("date")) {
            // store and extract date
          } else if (className.equals("...")) {
            // do something else
          } // else...
        }
    }
    

    Not only you can select subelements by tag, but also by name, by some attributes etc. Check the official documentation for more info: https://jsoup.org/cookbook/extracting-data/dom-navigation

    Disclaimer: I could not test the code right now.