Search code examples
pythonpandasexcelxlsximageurl

How can I scrape an embedded image url from xlsx spreadsheet


I have Excel spreadsheets that have an image per row, scraping the image using this example works. However what I want to do instead of scraping the image of the spreadsheet is I want to extract the url associated with that image. If I open up the Excel file I can click on the image and navigate to the given url. Is it impossible to extract this URL via Python?

I have looked through the documentation on openpyxl to see if there are any examples of scraping embedded urls in images, and I couldn't find anything.

Any help would be much appreciated. Thanks


Solution

  • As a good start, you can read/unzip the spreadsheet :

    import zipfile
    import pandas as pd
    
    with zipfile.ZipFile("file.xlsx", "r") as zf:
        xmls = [zf.read(fn) for fn in zf.infolist()
                if fn.filename.startswith("xl/drawings/_rels/")]
    
    urls = (
        pd.concat([pd.read_xml(data).assign(SheetNumber=i)
             for i, data in enumerate(xmls, start=1)]).sort_values(by=["SheetNumber", "Id"])
            .loc[lambda x: x["TargetMode"].eq("External"), ["SheetNumber", "Target"]]
            .reset_index(drop=True)
    )
    

    Output :

    print(urls)
    
       SheetNumber                           Target
    0            1       https://stackoverflow.com/
    1            1   https://gis.stackexchange.com/
    2            2  https://meta.stackexchange.com/
    3            2           https://askubuntu.com/
    

    To go further, we can use and the Styler to put the images next to their urls :

    import string
    import base64
    from collections import defaultdict
    from openpyxl import load_workbook   
    
    workbook = load_workbook("file.xlsx")
    
    images = defaultdict(list)
    for ws in workbook:
        #https://github.com/ultr4nerd/openpyxl-image-loader
        for image in ws._images:
            row = image.anchor._from.row + 1
            col = string.ascii_uppercase[image.anchor._from.col]
            images[ws.title].append({f'{col}{row}': image._data()})
    
    def tag_img(ser):
        return r'<div style="display: flex; justify-content: center;">'\
                '<img src="data:image/png;base64,{}" width="200" height="50"></div>' \
                 .format(base64.b64encode(ser).decode("utf-8"))
    
    imgs = pd.concat(
        [pd.DataFrame(v).stack().apply(tag_img)
            .reset_index(level=1, name="Image")
            .assign(SheetName=k).rename(columns={"level_1": "CellCoord"})
                     for k,v in images.items()], ignore_index=True
    )
    
    (
        imgs.join(urls)[["SheetNumber", "SheetName", "CellCoord", "Image", "Target"]].style
            .set_properties(**{"border":"1px solid",
                "text-align": "center", "background-color": "white"})
            .format(hyperlinks="html").pipe(display)
    )
    

    Output :

    enter image description here

    Spreadsheet used (file.xlsx):

    enter image description here