Search code examples
pythonpandasdataframeweb-scrapingtry-except

Python Web Scraping: How to use Try/Except to handle missing values


I am trying to use Try/Except in order to handle potential missing values as I scrape through a list of URLs containing restaurant data. I need each list to be of equal length in order to make a pandas DataFrame.

I'm trying to have missing values coded as None or some other recognizeable form. Currently, the websites list is of length 71, while the others are 76. The error is: ValueError: arrays must all be same length.

Scraping code (see Try/Except part):

# Initialize lists
names = []
addresses = []
zip_codes = []
websites = []

# Scrape through list of urls
for link in url_list:
    r = requests.get(link).text
    soup = BeautifulSoup(r, 'lxml')

    place_name = soup.find('h1').text
    names.append(place_name)

    place_data = soup.find('h6')

    place_address = place_data.text.split(',')[0]
    addresses.append(place_address)

    place_zip = place_data.text.split(',')[1][1:5]
    zip_codes.append(place_zip)

    # Replace missing value with None
    try:
        place_web = place_data.a['href']
        websites.append(place_web)
    except Exception as e:
        place_web = None

I basically get an error when I want to create a DataFrame like so:

restaurant_data = pd.DataFrame({'name' : names, 
                                'address' : addresses, 
                                'zip_code' : zip_codes,
                                'website' : websites})

I also tried changing None to a string like 'NA' but the error prevailed. I didn't want to continue sending GET requests endlessly. Does anyone have an idea on how to fix this? Thanks.


Solution

  • It seems like, from your description, the problem is that you aren't adding enough items to your website list. You can use this array initalizer to create an empty list of size 5:

    websites = [None] * 5
    

    which will make a list that is just

    [None, None, None, None, None]
    

    You'd also have to actually append that to your websites list, which you are not doing in your current except statement, so your try except would look like

    try:
        place_web = place_data.a['href']
        websites.append(place_web)
    except Exception as e:
        place_web = [None] * 5
        websites.append(place_web)
    

    This also sort of assumes that the websites list is always the same size, is that the case? i.e. if it doesn't fail, it will always have 6 links?