Search code examples
pythonhtmlweb-scrapingbeautifulsouppython-requests-html

Extracting specific part of html


I am working on a webscraper using html requests and beautiful soup (New to this). For 1 webpage (https://www.selfridges.com/GB/en/cat/beauty/make-up/?pn=1) I am trying to scrape a part, which I will replicate for other products. The html looks like:

<div class="plp-listing-load-status c-list-header__counter initialized" data-page-number="1" data-total-pages-count="57" data-products-count="60" data-total-products-count="3361" data-status-format="{available}/{total} results">60/3361 results</div>

I want the scrape the "57" from the data-total-pages-count="57". I have tried using:

soup = BeautifulSoup(page.content, "html.parser")
nopagesstr = soup.find(class_="plp-listing-load-status c-list-header__counter initialized").get('data-total-pages-count')

and

nopagesstr = r.html.find('[data-total-pages-count]',first=True)

But both return None. I am not sure how to select the 57 specifically. Any help would be appreicated


Solution

  • To get total pages count, you can use this example:

    import requests
    from bs4 import BeautifulSoup
    
    
    url = "https://www.selfridges.com/GB/en/cat/beauty/make-up/?pn=1"
    
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0"
    }
    soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")
    print(soup.select_one("[data-total-pages-count]")["data-total-pages-count"])
    

    Prints:

    56