Search code examples
pythonweb-scrapingbase64decode

Issues with decoding base64 image data in Python's Beautiful Soup


I'm trying to scrape some data from a website using Python and Beautiful Soup, specifically an image in base64 format. However, when I run my code, the image data appears in a strange format like this:

"image": "data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7",

Here's the relevant code snippet:

def search_mercadolivre_by_category(category):
    url = f"https://lista.mercadolivre.com.br/{category}"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    products = soup.find_all("li", {"class": "ui-search-layout__item"})
    results = []
    for product in products:
        title = product.find("h2", {"class": "ui-search-item__title"}).text.strip()
        price = product.find("span", {"class": "price-tag-fraction"}).text.strip()
        link = product.find("a", {"class": "ui-search-link"})['href']
        image = product.find("img")['src']
        results.append({
            "title": title,
            "price": price,
            "link": link,
            "image": image,
            "category": category,
            "website": "Mercado Livre",
            "keyword": ""
        })
    return results

Can anyone help me decode the image data properly?

I was expecting to find this source here.

<img width="160" height="160" decoding="async" src="https://http2.mlstatic.com/D_NQ_NP_609104-MLA50695427900_072022-V.webp" class="ui-search-result-image__element shops__image-element" alt="Samsung Galaxy M13 Dual SIM 128 GB verde 4 GB RAM">

Solution

  • That's a DataURI. You can most simply read it like this:

    from urllib import request
    
    with request.urlopen('data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7') as DataURI:
       im = DataURI.read()
    

    If you look at the first few bytes, you can see it is indeed a 1x1 GIF image:

    print(im[:10])       # prints b'GIF89a\x01\x00\x01\x00'
    

    If you want to save it to disk as image.gif, you can use:

    from pathlib import Path
    Path('image.gif').write_bytes(im)
    

    If you want to open it in PIL, you can wrap it in a BytesIO and open it like this:

    from PIL import Image
    from io import BytesIO
    
    # Open as PIL Image
    PILImage = Image.open(BytesIO(im))
    
    PILImage.show()               # display in viewer
    PILImage.save('result.png')   # save to disk as PNG