Search code examples
pythonselenium-webdriverweb-scrapingbeautifulsoup

How to get images in Beautifulsoup from javascript?


At my shcool we have a interactive white boards and we can export them to a website with a provided link. Only problem is that the links expire (which is stupid), so I want to make a simple python script that gets the images and downloads them.

Here is the link to the website: https://air.ifpshare.com/documentPreview.html?s_id=8ec97e16-51c4-4a77-9f64-7d5dccd9bb41#/detail/561f0184-384c-4ca1-91a4-b2e687865408/record

When I open chrome and inspect the website, I see that the images are contained in a main divider with sub divider and image elements which encode the image in base 64. This is thus easy to decode them in python.

This is the simple script i wrote to get the html:

import requests

page = requests.get("https://air.ifpshare.com/documentPreview.html?s_id=8ec97e16-51c4-4a77-9f64-7d5dccd9bb41#/detail/561f0184-384c-4ca1-91a4-b2e687865408/record")
print(page.text)

Only problem is, when I try to get the html, I don't get any of the content... The content seems to be coming from the javascript that is in the website.

The same thing happens when I use Selenium

Here is what I get:

<!DOCTYPE html><html><head><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=no"><link rel=stylesheet href=//at.alicdn.com/t/font_833191_27456hr9ow5.css><title id=PageTitle></title><style>html,
    body {
      max-width: 480px;
      height: 100%;
      margin: auto;
      background-size: 100% 100%;
      background: #F8F8F8;
    }</style><link href=/static/css/documentPreview.01a0856b7f615fdfd7f4b853e047bcd0.css rel=stylesheet></head><body><div id=app></div><script type=text/javascript src=/static/js/manifest.a3f705024b2774dd271e.js></script><script type=text/javascript src=/static/js/vendor.03a8b2ef6819d9eaa4e7.js></script><script type=text/javascript src=/static/js/documentPreview.a9f6fe7b5c4d6f073050.js></script></body></html> 

Does anyone know a workaround?


Solution

  • Note: this answer contains different methods to reach your goal.

    I saw your target web app fetching image download URLs from an API endpoint and it is easy to fetch those images using the requests library with a little bit of code (no need to use bs4 if you want).

    here is the API endpoint https://air.ifpshare.com/api/pub/files/UUID

    So what is the file UUID in your target URL?

    • Your provided URL: https://air.ifpshare.com/documentPreview.html?s_id=8ec97e16-51c4-4a77-9f64-7d5dccd9bb41#/detail/561f0184-384c-4ca1-91a4-b2e687865408/record
    • File UUID: after the /detail/ path you will see a UUID value, well, this is your file UUID,

    now merge this file UUID with the API endpoint you will get the downloadUrl value from the JSON response and this is your complete download URL, here is the code:

    import requests
    
    def fetchResp(UUID):
        url = f"https://air.ifpshare.com/api/pub/files/{UUID}"
        response = requests.get(url)
        items = response.json()['items']
        for n, urls in enumerate(items):
            image = urls['downloadUrl']
            image_url = f"https:{image}" #missing HTTP in the response value so added this manually
            image_data = requests.get(image_url).content
            with open(f"image-{n}.png", 'wb') as image_d:
                image_d.write(image_data)
    fetchResp('561f0184-384c-4ca1-91a4-b2e687865408') #file UUID is here