Search code examples
pythonparsingbeautifulsoupflickr

Beautifulsoup returns incomplete html


I am reading a book about Python right now. There is a small project for homework: "Write a program that goes to a photo-sharing site like Flickr or Imgur, searches for a category of photos, and then downloads all the resulting images." It is suggested to use only webbrowser, requests and bs4 libraries.

I cannot do it for Flickr. I found that the parser cannot go inside the element (div class="interaction-view"). Using "Inspect element" in Chrome I can see that there are a few "div" elements inside it and "a" element. However, when I use bs4 library it cannot see it.

My code like this:

#!/usr/bin/env python3
# To download photos from Flickr

import requests, bs4

search_name = "spam"
website_name = requests.get('https://www.flickr.com/search/?text='
                       + search_name)
website_name.raise_for_status()
parse_obj = bs4.BeautifulSoup(website_name.text, "html.parser")
elements = parse_obj.select('body #content main .main.search-photos-results \
                .view.photo-list-view.requiredToShowOnServer \
                .view.photo-list-photo-view.requiredToShowOnServer.awake \
                .interaction-view')
print(elements)

It only prints:

[<div class="interaction-view"></div>, <div class="interaction-view"></div>...]

Without any nested elements and I do not understand why... Thank you!


Solution

  • The issue is that the content of <div class="interaction-view"></div> on flickr is only loaded via javascript. You can check that if you view the page source, you'll find: <div class="interaction-view"></div> with no content in the div tag.

    You need to execute javascript somehow. Since beautifulsoup doesn't offer this, one solution is to use selenium for that. pip install selenium and install geckodriver for firefox (on OSX: brew install geckodriver). Then change your code to use selenium to load the page:

    #!/usr/bin/env python3
    
    import requests, bs4
    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    search_name = "spam"
    url = 'https://www.flickr.com/search/?text=%s' % search_name
    
    browser = webdriver.Firefox()
    browser.get(url)
    delay = 3
    WebDriverWait(browser, delay).until(EC.presence_of_element_located(browser.find_element_by_id('...')))
    
    soup = bs4.BeautifulSoup(browser.page_source, "html.parser")
    
    
    elements = soup.select('body #content main .main.search-photos-results \
                    .view.photo-list-view.requiredToShowOnServer \
                    .view.photo-list-photo-view.requiredToShowOnServer.awake \
                    .interaction-view')
    print(elements)
    

    The WebDriverWait part is needed so selenium waits with parsing until a certain element is loaded. You need to change ... to an id you know will be present. See this answer to check how it can be done with classes.