Problem Description:
Each product on this website https://www.asos.com/us/women/dresses/cat/?cid=8799 has several images. For instance, this is one product URL of a black dress https://www.asos.com/us/asos-design/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-black/prd/204910824#colourWayId-204910828, and if you click on it, you can see that there are 4 images for this black dress. Also, there are 2 other color versions of this dress (camel and pink). For each of these colors, there are another 3-4 images. I would like to collect all of these images (each image of the black, camel, and pink version of this product).
What I tried (code below): So far, I have managed to collect all product URLS from the main page e.g aka 1 product URL = the 2nd link provided above. But, once I access each product URL, I cannot figure out how to access all the images inside this URL. I'd appreciate any guidance in realizing this next step.
Code from Google Colab:
# Upload google drive files
from google.colab import drive
drive.mount('/content/drive')
# Import libraries
import urllib
import urllib.request
from bs4 import BeautifulSoup
import re
import requests
import matplotlib.pyplot as plt
from io import BytesIO
# Make Soup function
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
headers={'User-Agent':user_agent,}
def make_soup(url):
request= urllib.request.Request(url, None,headers)
thepage = urllib.request.urlopen(request)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
# Find total page #s
site = 'https://www.asos.com/us/women/dresses/cat/?cid=8799'
soup = make_soup(site)
element = soup.find('p', class_='label_Ph1fi')
element = element.text
numbers = re.findall(r'\d{1,3}(?:,\d{3})*', element)
if len(numbers) >= 2:
offset = int(numbers[0].replace(',', ''))
num_images = int(numbers[1].replace(',', ''))
num_pages = int(num_images / offset)
print(f"Images Per Page: {offset}")
print(f"Total Images: {num_images}")
print(f"Total Pages:{num_pages}")
else:
print("Numbers not found")
#num_images = int(element.replace(',', '').split(' ')[0])
# Get all product urls
product_urls = []
for i in range(num_pages):
site = 'https://www.asos.com/us/women/dresses/cat/?cid=8799&page='
site = site + str(i)
soup = make_soup(site)
a = soup.find_all('a',class_='productLink_E9Lfb',href=True)
for link in soup.find_all('a', class_='productLink_E9Lfb', href=True):
href = link.get('href')
if href:
product_urls.append(href)
print('Page ', i, ' done')
print(product_urls)
# Get all images per product url
You can try:
import json
import re
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/119.0"
}
def get_data(html_source):
data = re.search(r"window\.asos\.pdp\.config\.product = (.*);", html_source)
data = json.loads(data.group(1))
return data
def get_images(url):
data = get_data(requests.get(url, headers=headers).text)
for i in data["images"]:
print(f'{i["colour"]:<15} {i["url"]}')
base_url = "https://www.asos.com/us/asos-design/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-black/prd/204910824"
data = get_data(requests.get(base_url, headers=headers).text)
u = "https://www.asos.com/us/prd/"
for p in data["facetGroup"]["facets"][0]["products"]:
get_images(u + str(p["productId"]))
Prints:
PINK https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-light-pink/204910765-1-pink
https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-light-pink/204910765-2
https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-light-pink/204910765-3
https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-light-pink/204910765-4
CAMEL https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-camel/204910786-1-camel
https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-camel/204910786-2
https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-camel/204910786-3
https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-camel/204910786-4
BLACK https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-black/204910824-1-black
https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-black/204910824-2
https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-black/204910824-3
https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-black/204910824-4