I am writing a code where I need to get all the links/URLs from a specific website but, It seems like the links are dynamically generated and might be populated using JavaScript or some other dynamic content-loading mechanism after the initial HTML is fetched.
Initially, I used the following code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.example.com/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
print(link.get('href'))
However, this approach didn't work because the links were not directly written in the HTML. Instead, they are generated later with JavaScript code. How can I extract these dynamically generated links? A simple hint would be greatly appreciated.
You're right that the HTML doesn't contain the links so you need an approach that lets the javascript run before scraping the page. I like selenium webdriver and chromedriver
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://www.example.com/'
driver = webdriver.Chrome() # Open an automated browser
driver.get(url) # Navigate to target page
soup = BeautifulSoup( # Parse content _after_ any dynamic javascript stuff
DRIVER.page_source,
'html.parser'
)
urls = []
for link in soup.find_all('a'):
print(link.get('href'))