Search code examples
python-3.xweb-scrapingbeautifulsouppython-requestsimagedownload

How to download the highest img with Python where the URL does not have any different marker?


I am kind new to Python and webscraping, so, pardon me if this is a quite simple task, but i think i reached a dead end.

I am trying to scrape Getty Images site, for example, URL https://www.gettyimages.com/detail/1135439582.

I already scraped lots of information that i needed, Title, Caption, even tags with the help of some RegEx, but, i can't seem to be able to download the img with the highest resolution.

I can't use Selenium right now, so, i am trying my best to work with BeautifulSoup and requests

The site uses one img tag, and when you "click" on the img it shows a higher resolution image,i can only find that information on the souce tag :

[<source media="(max-width: 1100px)" srcset="https://media.gettyimages.com/id/1135439582/pt/foto/vintage-culture-performs-live-onstage-during-the-second-day-of-lollapalooza-brazil-music.webp?s=612x612&amp;w=gi&amp;k=20&amp;c=32tqshWj7kB2Hn55L_uiRjde7yg5Sm6wuk3HYnnGJfc=" type="image/webp"/>, 
<source media="(max-width: 1100px)" srcset="https://media.gettyimages.com/id/1135439582/pt/foto/vintage-culture-performs-live-onstage-during-the-second-day-of-lollapalooza-brazil-music.jpg?s=612x612&amp;w=gi&amp;k=20&amp;c=32tqshWj7kB2Hn55L_uiRjde7yg5Sm6wuk3HYnnGJfc=" type="image/jpeg"/>, 
<source media="(max-width: 1530px)" srcset="https://media.gettyimages.com/id/1135439582/pt/foto/vintage-culture-performs-live-onstage-during-the-second-day-of-lollapalooza-brazil-music.webp?s=1024x1024&amp;w=gi&amp;k=20&amp;c=NuBtfcWoTL0JhTsdWRd9UcCasir1L7ywlVHY2PZqcmM=" type="image/webp"/>, 
<source media="(max-width: 1530px)" srcset="https://media.gettyimages.com/id/1135439582/pt/foto/vintage-culture-performs-live-onstage-during-the-second-day-of-lollapalooza-brazil-music.jpg?s=1024x1024&amp;w=gi&amp;k=20&amp;c=NuBtfcWoTL0JhTsdWRd9UcCasir1L7ywlVHY2PZqcmM=" type="image/jpeg"/>, 
<source **srcset**="https://media.gettyimages.com/id/1135439582/pt/foto/vintage-culture-performs-live-onstage-during-the-second-day-of-lollapalooza-brazil-music.webp?s=**2048x2048**&amp;w=gi&amp;k=20&amp;c=XUK00llqC1n_I1ZD2pjiRUYcqsM5XUZCD_dl7z9ouAc=" **type="image/webp"**/>, 
<source **srcset**="https://media.gettyimages.com/id/1135439582/pt/foto/vintage-culture-performs-live-onstage-during-the-second-day-of-lollapalooza-brazil-music.jpg?s=**2048x2048**&amp;w=gi&amp;k=20&amp;c=XUK00llqC1n_I1ZD2pjiRUYcqsM5XUZCD_dl7z9ouAc=" **type="image/jpeg"**/>]

I get this information with this line :

img = soup.find_all('source')

The IMG tag holds the same link :

 <img alt="Lollapalooza Sao Paulo 2019 - Day 2" class="AssetCard-module__image___dams4" data-testid="image-card-image" src="https://media.gettyimages.com/id/1135439582/pt/foto/vintage-culture-performs-live-onstage-during-the-second-day-of-lollapalooza-brazil-music.jpg?s=1024x1024&amp;w=gi&amp;k=20&amp;c=NuBtfcWoTL0JhTsdWRd9UcCasir1L7ywlVHY2PZqcmM="/> 

I can understand that the source tag with srcset holds the correct information, but the link to download the img is the same that is used on the IMG tag.

If i try to download the img with that link, i will get the lowest resolution.

How can i overcome this in a way that this will work for a list with 2k+ photos?

Since the link remains the same, how can i pass the dimension value to get the highest possible option?

Either jpg or webp, right now, don't matter which one.

Thank you all in advance =)


Solution

  • Here is one way of getting the source for the largest hero image on that page:

    import requests
    from bs4 import BeautifulSoup as bs
    
    url = 'https://www.gettyimages.nl/detail/nieuwsfoto%27s/vintage-culture-performs-live-onstage-during-the-second-nieuwsfotos/1135439582'
    headers = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
    }
    
    r = requests.get(url, headers=headers)
    soup = bs(r.text, 'html.parser')
    large_img_url = soup.select_one('picture[data-testid="hero-picture"]').find('source', media=None).get('srcset')
    print(large_img_url)
    

    Result:

    https://media.gettyimages.com/id/1135439582/nl/foto/vintage-culture-performs-live-onstage-during-the-second-day-of-lollapalooza-brazil-music.webp?s=2048x2048&w=gi&k=20&c=GWimj-WfQSKOeojX1albE8rZT5OX1M-usTQeqnSM0B4=
    

    BeautifulSoup documentation is quite extensive.