Search code examples
pythonbeautifulsoupweb-crawlerurllib

Web scraping of research paper on IEEE Xplore website using BeautifulSoup and request Python libraries


I am trying to scrape the Abstract of the research paper on IEEE Xplore website, link :. For this I used urllib library and Beautifulsoup in Python(3.10.9). Below is the code i have used:

`

    from urllib.request import Request, urlopen
    from bs4 import BeautifulSoup

    url = 'https://ieeexplore.ieee.org/document/8480057'

    headers = {"User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'}

    req = Request(url, headers=headers)
    response = urlopen(req, timeout=10)
    html_text = response.read()

    soup = BeautifulSoup(html_text, "html.parser")
    # Find the element containing the abstract
    abstract_element = soup.find("div", class_="u-pb-1")
    # Extract the text from the abstract element
    abstract = abstract_element.text.strip()
    # Print the abstract
    print(abstract)

`

Here i have attached the screenshot of html part having Abstract.

I am getting AttributeError: 'NoneType' object has no attribute 'text' for abstract.

I got the value of soup. But I don't know how to get the Abstract text. I am new to web scraping. I have tried a lot but didn't able to solve it. Please help me to solve this problem. Thanks.


Solution

  • In this case it's actually possible to get the abstract without the need for Selenium. I generally use the requests library and css selectors in BeautifulSoup and that's what I did here:

    import requests
    req = requests.get(url, headers=headers)
    soup = BeautifulSoup(req.text, "html.parser")
    

    and then simply:

    print(soup.select_one('meta[property="og:description"]')['content'])
    

    The output should be the abstract.