Search code examples
javaweb-scrapingjsoup

Jsoup scraping image url results in data:image/gif;base64,


I'm starting to learn Jsoup and want to scrape Tesco webstore. Here is a link:

https://www.tesco.com/groceries/en-GB/shop/fresh-food/all

I want to get an image of a product. When I'm browsing the code of the page from Google Chrome I get something like this:

<img src="https://img.tesco.com/Groceries/pi/321/5054775188321/IDShot_225x225.jpg" alt="Tesco British
 Unsalted Butter 250G" class="product-image" 
srcset="https://img.tesco.com/Groceries/pi/321/5054775188321/IDShot_90x90.jpg 
768w,https://img.tesco.com/Groceries/pi/321/5054775188321/IDShot_225x225.jpg 4000w">

But my code:

Document doc = null;
        try {
            doc = Jsoup.connect("https://www.tesco.com/groceries/en-GB/shop/home-and-ents/all?page=20").get();
        } catch (IOException e) {
            e.printStackTrace();
        }
        System.out.println(doc.getElementsByClass("product-image-wrapper").get(0));

results in:

<a href="/groceries/en-GB/products/295626079" aria-hidden="true" class="product-image-wrapper" tabindex="-1">
 <div class="product-image__container">
  <img src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" alt="Sterling Blue Superkings 100 Pack" class="product-image">
 </div></a>

I think the problem is that the URLs are loaded by JS and Jsoup is not supporting it. Is there any way to get the URL as I see it in chrome, or should I use more powerful tool such as HtmlUnit or Selenium.


Solution

  • So basically I've just switched to selenium. It may be slower, but at least the progress is going. I've also tried the HtmlUnit, but it seems to work badly with JS.