Search code examples
magentohtml-parsingbigcommercehtml-parserimage-extraction

Extracting the main product image from a ecommerce product page


I am looking for options to extract the main image from a product page on a retailer website, the problem is there are multiple images in a product page (related images) , one approach I thought would work would be to extract all the image links, download each one of them and compare the size of each of those images and consider the one has the largest size in terms of storage bytes as the one that is the main product image.

Obviously that would be a very inefficient approach , we know that most of the retailers use certain ecommerce platforms like magento , bigcommerce etc, the major ecommerce platforms are only handful , is it possible to detect the ecommerce platform and leverage the template provided by each one of them to precisely extract the main product image?

I know the approach would never be perfect , but I am looking an algorithm that would be mostly accurate about 80% or so , is it doable?


Solution

  • Do you have a list of retailers that you're looking to extract the images from? If so, then go through each retailer's site manually, look at its HTML, and create some code that would successfully extract the image from this particular retailer. If not, then I'm afraid you're out of luck - you could just grab the biggest image on the page, or use some other heuristic, but there's no guarantee that you're grabbing the actual product image.

    The problem with creating some sort of generic utility is that each e-commerce platform has its own structure for displaying product images, and that structure could be changed with each site. For example, just because Magento usually structures its images in a certain way, doesn't mean you'll always see them that way - it's entirely up to the theme that's currently applied.