Search code examples
pythonbeautifulsoupscraper

How to put mutual exclusive method in dictionary in python?


i am writing a scraper to extract the content of different sites. User inputs an url, my scraper would parse the url and find out which source does it come from (it only supports limited website) and extract the content according to the website's dom structure.

The easiest way looks like this:

extract(soup, url):

  if url in siteA:
    content = soup.find_all('p')[0]
  elif url in siteB:
    content = soup.find_all('p')[3]
  elif url in siteC:
    content = soup.find_all('div', {'id':'ChapterBody'})[0]
  elif url in siteD:
    content = soup.find_all("td", {"class": "content"})[0]

However the code is redundant as there are more sites with different rules coming, so I would like to compact the code and make it easier. Here is the way i tried:

extract(soup, url):

  support = {
            'siteA': soup.find_all('p')[0]
            'siteB': soup.find_all('p')[3]
            'siteC': soup.find_all('div', {'id':'ChapterBody'})[0]
            'siteD': soup.find_all("td", {"class": "content"})[0]
            }

  if url in support:
    content = support[url]

In this way I only need to keeping track of a dictionary rather than keep appending the code. However, each key value pair is being executed when i run the code, and index error is showed because some sites does not have a 'td' or 'div' with id 'chapterbody', so error would be raised when siteC/D in dictionary get executed.

I am wondering what are some possible ways to solve this issue while keep the code compact?


Solution

  • Convert the dictionary over to a dict of functions:

    support = {
              'siteA': lambda: soup.find_all('p')[0],
              'siteB': lambda: soup.find_all('p')[3],
              'siteC': lambda: soup.find_all('div', {'id':'ChapterBody'})[0],
              'siteD': lambda: soup.find_all("td", {"class": "content"})[0]
              }
    

    Now they don't execute until you call the function:

    if url in support:
        content = support[url]()
    

    Alternatively, pulling out the soup.find_all() call and having a dictionary of tuples (param, index) is also an option:

    support = {
              'siteA': (('p'), 0),
              'siteB': (('p'), 3),
              'siteC': (('div', {'id':'ChapterBody'}), 0),
              'siteD': (("td", {"class": "content"}), 0)
              }
    
    if url in support:
        param, index = support[url]
        content = soup.findall(*param)[index]