Search code examples
pythoncsspython-3.xpython-requests-html

Cannot find css class using Request HTML


After following this tutorial on finding a css class and copying the text on a website, I tried to implement this into a small text code but sadly it didnt work. I followed the tutorial exactly on the same website and did get the headline of the webpage, but cant get this process to work for any other class on that, or any other , webpage. Am I missing something? I am a beginner programmer and have never used Request HTML or anything similar before. Here is an example of the code I'm using, the purpose being to grab the random fact that appears in the "af-description" class when you load the webpage.

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://mentalfloss.com/amazingfactgenerator')
r.html.find('.af-description', first=True)
description = r.html.find('.af-description', first=True)
print("Fun Fact:" + description.text)

No matter how hard I try and no matter how I rearrange things or try different code, I cant get it to work. It seems to not be able to find the class or the text the class contains. Please help.


Solution

  • What you are trying to do requires that the HTML source contains an element with such a class. A browser does much more than just download HTML; it also downloads CSS and Javascript code when referenced by the page, and executes any scripts attached to the page, which can trigger further network activity. If the content you are looking for was generated by Javascript, you can see the elements in the browser development tools inspector, but that doesn't make the element accessible to the r.html object!

    In the case of the URL you tried to scrape, if you look at the network console you'll see that an AJAX request GET request http://mentalfloss.com/api/facts is made to fill the <div af-details> structures, so if you wanted to scrape that data you could just get it as JSON directly from the API:

    r = session.get('http://mentalfloss.com/api/facts')
    description = r.json()[0]['fact']
    print("Fun Fact:" + fact)
    

    You can make the requests_html session render the page with Javascript too by calling r.html.render().

    This then uses a headless browser to render the HTML, execute the JavaScript code embedded in it, fetch the AJAX request and render the additional DOM elements, then reflect the whole page back to HTML for your code to mine. The first time you do this the required libraries for the headless browser infrastructure are downloaded for you:

    >>> from requests_html import HTMLSession
    >>> session = HTMLSession()
    >>> r = session.get('http://mentalfloss.com/amazingfactgenerator')
    >>> r.html.render()
    [W:pyppeteer.chromium_downloader] start chromium download.
    Download may take a few minutes.
    # .... a lot more information elided
    [W:pyppeteer.chromium_downloader] chromium extracted to: /Users/mj/.pyppeteer/local-chromium/533271
    >>> r.html.render()
    >>> r.html.find('.af-description', first=True)
    <Element 'div' class=('af-description',)>
    >>> _.text
    'The cubicle did not get its name from its shape, but from the Latin “cubiculum” meaning bed chamber.'
    

    However, this requires your computer to do a lot more work; for this specific example, it's easier to just call the API directly.