Search code examples
web-scrapingattributeerrornonetype

Webscraping project: How to handle "AttributeError: 'NoneType' object has no attribute 'find'"


First time posting. I am learning python to do a web scraping project for my work. I am trying to collect information on the different projects this organisation shares on their website (my company has asked them permission, so that is all good). I managed to run the code with no issues when scraping their HPV projects (52 in total), but when trying to scrape their HIV projects (a total of 131) I am running the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_9556/1476973876.py in <module>
     11     project_description = soup.find('div', class_="column-1").text
     12     project_details = soup.find(class_="block-details")
---> 13     project_number = project_details.find("strong").text
     14     project_start = project_details.find("span", class_="bar-start").text
     15     project_end = project_details.find("span", class_="bar-end").text

AttributeError: 'NoneType' object has no attribute 'find'

When scraping a list of just the first 10 URLs, it works fine. I believe that the problem might be that one of the links doesn't have the "strong" text. If so, how can I identify which link is not working?

Here is my code (sorry if it is messy, would appreciate tips on how to improve)

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time


st = time.time()
URLs = ['https://www.zonmw.nl/nl/onderzoek-resultaten/preventie/gezonde-wijk-en-omgeving/programmas/project-detail/preventieprogramma-4/hiv-self-testing-combined-with-internet-counselling-a-low-threshold-strategy-to-increase-diagnoses/', 'https://www.zonmw.nl/nl/onderzoek-resultaten/fundamenteel-onderzoek/programmas/project-detail/vici/training-b-cells-to-generate-broadly-neutralizing-hiv-antibodies/', 'https://www.zonmw.nl/nl/over-zonmw/e-health-en-ict-in-de-zorg/programmas/project-detail/diseasemanagement-chronische-ziekten/comorbidity-and-aging-with-hiv/', 'https://www.zonmw.nl/nl/onderzoek-resultaten/geneesmiddelen/programmas/project-detail/kennisbeleid-kwaliteit-curatieve-zorg/een-multidisiplinaire-richtlijn-voor-arbeidsgerelateerde-problematiek-bij-mensen-met-hiv/', 'https://www.zonmw.nl/nl/onderzoek-resultaten/doelmatigheidsonderzoek/programmas/project-detail/goed-gebruik-geneesmiddelen/study-to-optimize-antiretroviral-regimens-in-hiv-infected-women-who-want-to-breastfeed-panna-b/', 'https://www.zonmw.nl/nl/over-zonmw/e-health-en-ict-in-de-zorg/programmas/project-detail/active-and-assisted-living-aal2/u-topia-towards-empowering-older-persons-living-with-hiv/', ...]
data = []


for URL in URLs:
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    project_title = soup.find('h1').text
    project_description = soup.find('div', class_="column-1").text
    project_details = soup.find(class_="block-details")
    project_number = project_details.find("strong").text
    project_start = project_details.find("span", class_="bar-start").text
    project_end = project_details.find("span", class_="bar-end").text
    project_program = project_details.find("ul").text
    
    
    for node in project_details.find_all("p"):
        keywords = node.text.split(', ')
    project_recipient = keywords[-1]
    
    data.append((project_title, project_description, project_number, project_start, project_end, project_program, project_recipient))

et = time.time()

elapsed_time = et - st
print('Execution time:', elapsed_time, 'seconds')

Thank you so much!


Solution

  • How can I identify which link is not working?

    1. Check the response's status code to see if the link is working or not - res.status_code == 200
    2. If links don't have the "strong" text, check if it has another class for that specific link and add it to your XPath.

    See the below sample for validation and improvement in code.

    First check if XPath/class is available or not then use .text. Also, initialize empty variables inside for loop so it won't throw an attribute error if an element is not found.

    for URL in URLs:
        page = requests.get(URL)
        soup = BeautifulSoup(page.content, 'html.parser')
        
        # initialize the variables. It takes the default value blank if XPath is unavailable.
        project_number = project_start = ''
        
        # xpath - [Do not use .text directly on xpath]
        project_number_xp = project_details.find("strong")
        project_start_xp = project_details.find("span", class_="bar-start")
        
        # apply .text if XPath is available else it will throw an error.
        if project_number_xp:
            project_number = project_number_xp.text
        if project_start_xp:
            project_start = project_start_xp.text
            
        # Append the data