Search code examples
pythonweb-scrapingpython-requests-html

Trouble extracting JavaScript content while using html_requests


I am currently working on a webscraper, and for the most part it works quite well. I have been using beautiful soup to extract html content; to extract javascript content, I just started with html_requests.

Unfortunately, I am running into some issues when extracting javascript data from the following website "https://goglobal.com/", specifically, where they have the section that includes "100+ countries", "2500+ employees", and "3 Billion dollars saved...". The code does not extract the values correctly. However, the code seems to be working fine for other websites which have dynamic content being loaded.

In an attempt to isolate the issue, I wrote the following script. But, the values from the goglobal website are still displayed incorrectly.

from requests_html import HTMLSession
import time
session = HTMLSession()
url = "https://goglobal.com/"
r = session.get(url)

r.html.render(wait=10)
time.sleep(10)
print(r.html.html)

For reference I searched through the displayed output by searching for "counter-number".

My questions are as follows:

  1. Why is this content not being loaded correctly?
  2. Is there a way to solve it while still using html_requets?
  3. Can I solve this using selenium or playwright/scrapy?

I attempted to identify and resolve the issue with the script above.


Solution

  • requests-html is pretty much deprecated, use requests & BeautifulSoup for static html, and selenium/playwright for harder to scrape/dynamic sites.

    In this case requests + bs4 will suffice, the numbers you are looking for are available in the static html, here's how to get them:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://goglobal.com/'
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',
    }
    
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    counters = {i.select_one('h3.title').text: i.select_one('span.counter-number').get('data-counter') for i in soup.select('div.counter-item')}
    print(counters)
    

    The reason why it is not working with requests-html is probably because you are looking in the wrong place, the value you are looking at is animated, and the animation only starts when the the element is visible/scrolled into view, but the actual number is in the data-counter attribute, the equivalent requests-html code still works without rendering:

    from requests_html import HTMLSession
    
    session = HTMLSession()
    url = "https://goglobal.com/"
    r = session.get(url)
    
    counters = {i.find('h3.title', first=True).text: i.find('span.counter-number', first=True).attrs.get('data-counter') for i in r.html.find('div.counter-item')}
    print(counters)
    

    again, requests-html is no longer being updated, and I much prefer requests & bs4, but both work in this situation.