Search code examples
pythonparsingbeautifulsouptext

How to iteratively retrieve the right information from beautiful soup elements?


I try to retrieve information from EZB press releases. To do so I use BeautifulSoup. Since the structure (HTML) of the press releases is changing over time, it is difficult to retrieve the date of the press releases with a single selector. Hence I tried to use "try and except" as well as "if/else statements" to retrieve the date from all HTML files. Unfortunately, my code does not work the way I want it to work since I do not get the adequate dates from all press releases.

Does anybody know how to iterate through multiple soup elements and choose the right element to select the date from the respective HTML file?

Attached my code:

from pandas.core.internals.managers import ensure_block_shape
import bs4, requests

pr_list = []

def parseContent(Urls):
  for x in Urls:
   res = requests.get(x)
   article = bs4.BeautifulSoup(res.text, 'html.parser')
   try:
    date = article.select('#main-wrapper > main > div.section > p.ecb-publicationDate')
    if date:
      for x in date:
        date = x.text.strip()   
    date = article.select('#main-wrapper > main > div.ecb-pressContentPubDate')
    if date:
      for x in date:
          date = x.text.strip()     
    else:
      date = article.select('#main-wrapper > main > div.title > ul > li.ecb-publicationDate')
      for x in date:
          date = x.text.strip()
   except:
    date = None
   try:
    title = article.select('#main-wrapper > main > div.title > h1')
    for x in title:
      title = x.text.strip()
   except:
    title = None
   try:
    body = article.select("#main-wrapper > main > div.section")
    for x in body:
      body = x.text.strip()
   except:
    body = None
   row = [date,title,body]
   pr_list.append(row)

Solution

  • Store your match expressions in a list and then iterate over them until one is successful:

    import bs4
    import requests
    
    
    date_expressions = [
        "#main-wrapper > main > div.section > p.ecb-publicationDate",
        "#main-wrapper > main > div.ecb-pressContentPubDate",
        "#main-wrapper > main > div.title > ul > li.ecb-publicationDate",
    ]
    
    title_expressions = [
        "#main-wrapper > main > div.title > h1",
    ]
    
    body_expressions = [
        "#main-wrapper > main > div.section",
    ]
    
    
    def try_several_expressions(article, expressions):
        """Try to match an element using the given list of expressions.
    
        Raise ValueError if we failed to find any matches or if we find
        multiple matches.
        """
    
        for expr in expressions:
            res = article.select(expr)
            if res:
                break
        else:
            raise ValueError("failed to match any expressions")
    
        if len(res) > 1:
            raise ValueError("failed to match a unique value")
    
        return res[0]
    
    
    def parseContent(urls):
        pr_list = []
        for url in urls:
            res = requests.get(url)
            article = bs4.BeautifulSoup(res.text, "html.parser")
            date = try_several_expressions(article, date_expressions).text
            title = try_several_expressions(article, title_expressions).text
            body = try_several_expressions(article, body_expressions).text
    
            row = [date, title, body]
            pr_list.append(row)
    
        return pr_list
    

    Assuming that you mean "ECB" rather than "EZB", I tested this against https://www.ecb.europa.eu/press/pr/date/2023/html/ecb.pr230710~77cf718c59.en.html and it seems to work as expected.


    If I make the one change I suggested in my comment (remove the if len(res) > 1 check), so that try_several_expressions looks like this:

    def try_several_expressions(article, expressions):
        """Try to match an element using the given list of expressions.
    
        Raise ValueError if we failed to find any matches or if we find
        multiple matches.
        """
    
        for expr in expressions:
            res = article.select(expr)
            if res:
                break
        else:
            raise ValueError("failed to match any expressions")
    
        # Always return the first matched element
        return res[0]
    

    Then the script works for every single url in your list except for https://www.ecb.europa.eu/press/pr/date/2020/html/ecb.pr2002242~8842dcb418.en.html, which doesn't have any content.

    If you put a try/except block in parseContent, you can simply ignore that failure:

    def parseContent(urls):
        pr_list = []
        for url in urls:
            res = requests.get(url)
            article = bs4.BeautifulSoup(res.text, "html.parser")
            try:
                date = try_several_expressions(article, date_expressions).text.strip()
                title = try_several_expressions(article, title_expressions).text.strip()
                body = try_several_expressions(article, body_expressions).text
            except ValueError:
                print(f'failed to parse: {url}')
                continue
    
            row = [date, title, body]
            pr_list.append(row)
    
        return pr_list