Search code examples
pythonweb-scrapingbuttonhref

Webscrape a link from a button with python


I am trying to webscrape a link that belongs to a previous button on this website. (The final purpose is to enrich data for a RAG chatbot)

https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/#t=Prinect%2Fmeasuring%2Fmeasuring-4.htm

The prev/next buttons are in the top right corner. The link that has to be extracted on the given example subpage would be this one:

href="https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/Prinect/measuring/measuring-3.htm"

I tried the standard way with Beautifulsoup:

from bs4 import BeautifulSoup
import requests

url = "https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/#t=Prinect%2Fmeasuring%2Fmeasuring-4.htm"
page = requests.get(url)

soup = BeautifulSoup(page.content, "html.parser")

# get full html section
test1 = soup.find(id="browseSeqBack")
print(test1)

# get full html section test 2
test2 = soup.find("div", class_="brs_previous").children
print(test2)

# get link directly test 3
secBackButton = soup.find(id="browseSeqBack")
href = secBackButton.attrs.get('href', None)
print(href)

However, neither do test 1 and 2 deliver the whole html section, nor does the direct query for the link work. this section comes back with test1:

<a class="wBSBackButton" data-attr="href:.l.brsBack" data-css="visibility: @.l.brsBack?'visible':'hidden'" data-rhwidget="Basic" id="browseSeqBack">
                                                 <span aria-hidden="true" class="rh-hide" data-html="@KEY_LNG.Prev"></span>

Thanks in Advance :)


Solution

  • The actual content is within an iframe that has a slightly different url; Prinect/measuring/measuring-4.htm instead of #t=Prinect%2Fmeasuring%2Fmeasuring-4.htm

    You can get the content + the next & previous paths like this:

    import requests
    from bs4 import BeautifulSoup
    from urllib.parse import urljoin
    
    base_url = 'https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/'
    path = 'Prinect/measuring/measuring-4.htm' 
    
    response = requests.get(urljoin(base_url, path))
    soup = BeautifulSoup(response.text, 'html.parser')
    
    prev_path = soup.head.select_one('meta[name=brsprev]').get('value')
    next_path = soup.head.select_one('meta[name=brsnext]').get('value')
    
    print(f'previous: {urljoin(base_url, prev_path)}')
    print(f'next: {urljoin(base_url, next_path)}')