Search code examples
javascriptpythonhtmlxmlhttprequest

How to extract dynamically generated links from a website using Python?


I am writing a code where I need to get all the links/URLs from a specific website but, It seems like the links are dynamically generated and might be populated using JavaScript or some other dynamic content-loading mechanism after the initial HTML is fetched.

Initially, I used the following code:

import requests
from bs4 import BeautifulSoup


url = 'https://www.example.com/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')

urls = []
for link in soup.find_all('a'):
    print(link.get('href'))

However, this approach didn't work because the links were not directly written in the HTML. Instead, they are generated later with JavaScript code. How can I extract these dynamically generated links? A simple hint would be greatly appreciated.


Solution

  • You're right that the HTML doesn't contain the links so you need an approach that lets the javascript run before scraping the page. I like selenium webdriver and chromedriver

    from bs4 import BeautifulSoup
    from selenium import webdriver
    
    url = 'https://www.example.com/'
    
    driver = webdriver.Chrome() # Open an automated browser
    driver.get(url)             # Navigate to target page
    
    soup = BeautifulSoup(       # Parse content _after_ any dynamic javascript stuff
        DRIVER.page_source,
        'html.parser'
    )
    
    urls = []
    for link in soup.find_all('a'):
        print(link.get('href'))