I am currently working on a webscraper, and for the most part it works quite well. I have been using beautiful soup to extract html content; to extract javascript content, I just started with html_requests.
Unfortunately, I am running into some issues when extracting javascript data from the following website "https://goglobal.com/", specifically, where they have the section that includes "100+ countries", "2500+ employees", and "3 Billion dollars saved...". The code does not extract the values correctly. However, the code seems to be working fine for other websites which have dynamic content being loaded.
In an attempt to isolate the issue, I wrote the following script. But, the values from the goglobal website are still displayed incorrectly.
from requests_html import HTMLSession
import time
session = HTMLSession()
url = "https://goglobal.com/"
r = session.get(url)
r.html.render(wait=10)
time.sleep(10)
print(r.html.html)
For reference I searched through the displayed output by searching for "counter-number".
My questions are as follows:
I attempted to identify and resolve the issue with the script above.
requests-html
is pretty much deprecated, use requests
& BeautifulSoup
for static html, and selenium/playwright
for harder to scrape/dynamic sites.
In this case requests + bs4 will suffice, the numbers you are looking for are available in the static html, here's how to get them:
import requests
from bs4 import BeautifulSoup
url = 'https://goglobal.com/'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
counters = {i.select_one('h3.title').text: i.select_one('span.counter-number').get('data-counter') for i in soup.select('div.counter-item')}
print(counters)
The reason why it is not working with requests-html is probably because you are looking in the wrong place, the value you are looking at is animated, and the animation only starts when the the element is visible/scrolled into view, but the actual number is in the data-counter
attribute, the equivalent requests-html code still works without rendering:
from requests_html import HTMLSession
session = HTMLSession()
url = "https://goglobal.com/"
r = session.get(url)
counters = {i.find('h3.title', first=True).text: i.find('span.counter-number', first=True).attrs.get('data-counter') for i in r.html.find('div.counter-item')}
print(counters)
again, requests-html is no longer being updated, and I much prefer requests & bs4, but both work in this situation.