Search code examples
javajsoup

How to scrape product data from https://www.jumia.ma/pc-portables/


I have a problem getting the src of img and a tags using jsoup on a spring project , Im getting instead '1', '5', 'data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7'

Here is the code:

String url1 = "https://www.jumia.ma/pc-portables/";
Document doc1 = Jsoup.connect(url1).get();
 
//List de noms et prix produits et le src des images :
List<Element> noms1 = doc1.getElementsByClass("name");
List<Element> prix1 = doc1.getElementsByClass("prc");
List<Element> images1 = doc1.select("img");
 
for(int i = 0 ; i < 10 ; i++) {
    if (!noms1.get(i).ownText().isEmpty() && !prix1.get(i).ownText().isEmpty()) {
        Produit p = new Produit();
        p.setNom(noms1.get(i).ownText());
        p.setPrix(prix1.get(i).ownText());
        p.setImage(images1.get(i).attr("abs:src"));
        p.setUrl("https://www.jumia.ma/pc-portables/");
        p.setIdcategorie(5);
        produitRepository.save(p);
    }   
}

Solution

  • These queries are simply too broad for your use case:

    List<Element> noms1 = doc1.getElementsByClass("name");
    List<Element> prix1 = doc1.getElementsByClass("prc");
    List<Element> images1 = doc1.select("img");
    

    You can always print your query results and see for yourself. For example:

    List<Element> noms1 = doc1.getElementsByClass("name");
    System.out.println(noms1);
    

    You should split up your query into two steps. First, select the container with the item list, then process each item. Try this:

    String url = "https://www.jumia.ma/pc-portables/";
    Document doc = Jsoup.connect(url).get();
    Element catalog = doc.selectFirst("[data-catalog]");
    for (Element item : catalog.select("article")) {
        // name
        System.out.println(item.select(".info > .name").text());
        // price
        System.out.println(item.select(".info > .prc").text());
        // image
        System.out.println(item.selectFirst(".img-c > img").attr("data-src"));
    }
    

    Note that this only parses the item data from on the first page. If you want to scrape additional pages, repeat the process by using a different url. For example, here is the url that loads the second page:

    https://www.jumia.ma/pc-portables/?page=2#catalog-listing