Search code examples
pythonbeautifulsoup

Unable to select tags in BeautifulSoup via CSS Selector


folks! I am currently working with BeautifulSoup to try to scrape some data from a website, and I'm having some issues trying to select elements using soup.select().

Here's a screenshot from my browser of the section of code I'm working with.

enter image description here

Here's the very simple code I'm using at the moment to scrape data, the idea is that I am trying to select all of the <a href> elements from within the <div> with id=lst_hdr_bm:

import urllib.request
from bs4 import BeautifulSoup

# Grab website source, make soup?
html = urllib.request.urlopen('https://infinitediscs.com')
soup = BeautifulSoup(html, 'html.parser')

tags = soup.select('#lst_hdr_bm > ul > li > a')
print(tags)

When I run this query in my browser (testing the CSS selector via document.querySelectAll), it returns 82 elements which is to be expected. When I run this via BS in Python, nothing is returned.

What could be causing this problem? Is there some default depth limit that can be parsed by the default html parser possibly? I am confused.


Solution

  • Since this site is loaded/modified dynamically using javasript, the final html (in the browser) will be different from the raw html retrieved using a GET request. this is why you should always double check the html either in the network tab or the response.text.

    If we observe the network tab, we can find that the desired data is retrieved from this endpoint: https://infinitediscs.com/Home/LoadNavbarData using a POST request.

    With this code you can get the list of brands:

    import requests
    import json
    from urllib.parse import urljoin
    
    base_url = 'https://infinitediscs.com/'
    data_url = '/Home/LoadNavbarData'
    
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'}
    response = requests.post(urljoin(base_url, data_url), headers=headers)
    response.raise_for_status()
    
    obj = json.loads(response.text)
    
    # Brands: List[Dict] (82 items)
    brands = obj['header_data']['BrandMenu']
    
    # if you want you can create dict maping each brand name to its full url
    # brand_links['ABC'] => https://infinitediscs.com/brand/abc
    brand_links = {brand['ItemTitle']: urljoin(base_url, brand['ItemLink']) for brand in brands}