Search code examples
pythonweb-scrapingurlindex-error

Index error: list index out of range - How to skip a broken URL?


How can I tell my program to skip broken / non-existent URLs and continue with the task? Every time I run this, it will stop whenever it encounters a URL that doesn't exist and gives the error: index error: list index out of range.

The range is URL's between 1 to 450, but there are some pages in the mix that are broken (for example, URL 133 doesn't exist).

import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup

df = pd.DataFrame()

for id in range (1, 450):

      url = f"https://liiga.fi/api/v1/shotmap/2022/{id}"
      res = requests.get(url)
      soup = BeautifulSoup(res.content, "lxml")
      s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
      s = s.replace('null','"placeholder"')
      data = json.loads(s)
      data = json_normalize(data)
      matsit = pd.DataFrame(data)
      df = pd.concat([df, matsit], axis=0)


df.to_csv("matsit.csv", index=False)

Solution

  • I would assume your index error comes from the line of code with the following statement:

    s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
    

    You could solve it like this:

    try:
        s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
    except IndexError as IE:
        print(f"Indexerror: {IE}")
        continue
    

    If the error does not occur on the line above, just catch the exception on the line where the index error is occuring. Alternatively you can also just catch all exceptions with

    
    try:
        code_where_exception_occurs
    except Exception as e:
        print(f"Exception: {e}")
        continue
    

    but I would recommend to be as specific as possible, so that you handle all expected errors in the appropriate way. In the example above replace code_where_exception_occurs with the code. You could also put the try/except clause around the whole block of code inside the for loop, but it is best to catch all exeptions individually. This should also work:

    try:
        url = f"https://liiga.fi/api/v1/shotmap/2022/{id}"
        res = requests.get(url)
        soup = BeautifulSoup(res.content, "lxml")
        s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
        s = s.replace('null','"placeholder"')
        data = json.loads(s)
        data = json_normalize(data)
        matsit = pd.DataFrame(data)
        df = pd.concat([df, matsit], axis=0)
    except Exception as e:
        print(f"Exception: {e}")
        continue