Search code examples
htmlweb-scrapingdata-extraction

how to get data from <ul>,<li>l ist tags to scrape data


I have already extracted data from a webpage but i cannot able to extract data from webpage which does not have unique identifier

I have already tried to extract data from a webpage which has unique identifiers like class ,span ,id but what to do when the page doesn't have unique identifier

url="https://dblp.org/"
r=requests.get(url)
print(r.content)
b=BeautifulSoup(r.text,"html.parser")
print(b.prettify())
a=b.find_all('ul',{"id":"browsable"})  #no id is available

It actually shows None where the expected results should be a list of links available


Solution

  • You can use type selector for a tags within li elements. Using the body parent tag as an example, you can then get the li elements child a hrefs with the following:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://dblp.org/'
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'lxml')
    links = [item['href'] for item in soup.select('body li a')]
    print(links)
    

    If must have parent ul tag then:

    body ul li a
    

    Worth noting two of the script tags in particular also contain a json structure with links available depending on your needs.