I'm starting to learn Jsoup and want to scrape Tesco webstore. Here is a link:
https://www.tesco.com/groceries/en-GB/shop/fresh-food/all
I want to get an image of a product. When I'm browsing the code of the page from Google Chrome I get something like this:
<img src="https://img.tesco.com/Groceries/pi/321/5054775188321/IDShot_225x225.jpg" alt="Tesco British
Unsalted Butter 250G" class="product-image"
srcset="https://img.tesco.com/Groceries/pi/321/5054775188321/IDShot_90x90.jpg
768w,https://img.tesco.com/Groceries/pi/321/5054775188321/IDShot_225x225.jpg 4000w">
But my code:
Document doc = null;
try {
doc = Jsoup.connect("https://www.tesco.com/groceries/en-GB/shop/home-and-ents/all?page=20").get();
} catch (IOException e) {
e.printStackTrace();
}
System.out.println(doc.getElementsByClass("product-image-wrapper").get(0));
results in:
<a href="/groceries/en-GB/products/295626079" aria-hidden="true" class="product-image-wrapper" tabindex="-1">
<div class="product-image__container">
<img src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" alt="Sterling Blue Superkings 100 Pack" class="product-image">
</div></a>
I think the problem is that the URLs are loaded by JS and Jsoup is not supporting it. Is there any way to get the URL as I see it in chrome, or should I use more powerful tool such as HtmlUnit or Selenium.
So basically I've just switched to selenium. It may be slower, but at least the progress is going. I've also tried the HtmlUnit, but it seems to work badly with JS.