First time posting. I am learning python to do a web scraping project for my work. I am trying to collect information on the different projects this organisation shares on their website (my company has asked them permission, so that is all good). I managed to run the code with no issues when scraping their HPV projects (52 in total), but when trying to scrape their HIV projects (a total of 131) I am running the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_9556/1476973876.py in <module>
11 project_description = soup.find('div', class_="column-1").text
12 project_details = soup.find(class_="block-details")
---> 13 project_number = project_details.find("strong").text
14 project_start = project_details.find("span", class_="bar-start").text
15 project_end = project_details.find("span", class_="bar-end").text
AttributeError: 'NoneType' object has no attribute 'find'
When scraping a list of just the first 10 URLs, it works fine. I believe that the problem might be that one of the links doesn't have the "strong" text. If so, how can I identify which link is not working?
Here is my code (sorry if it is messy, would appreciate tips on how to improve)
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
st = time.time()
URLs = ['https://www.zonmw.nl/nl/onderzoek-resultaten/preventie/gezonde-wijk-en-omgeving/programmas/project-detail/preventieprogramma-4/hiv-self-testing-combined-with-internet-counselling-a-low-threshold-strategy-to-increase-diagnoses/', 'https://www.zonmw.nl/nl/onderzoek-resultaten/fundamenteel-onderzoek/programmas/project-detail/vici/training-b-cells-to-generate-broadly-neutralizing-hiv-antibodies/', 'https://www.zonmw.nl/nl/over-zonmw/e-health-en-ict-in-de-zorg/programmas/project-detail/diseasemanagement-chronische-ziekten/comorbidity-and-aging-with-hiv/', 'https://www.zonmw.nl/nl/onderzoek-resultaten/geneesmiddelen/programmas/project-detail/kennisbeleid-kwaliteit-curatieve-zorg/een-multidisiplinaire-richtlijn-voor-arbeidsgerelateerde-problematiek-bij-mensen-met-hiv/', 'https://www.zonmw.nl/nl/onderzoek-resultaten/doelmatigheidsonderzoek/programmas/project-detail/goed-gebruik-geneesmiddelen/study-to-optimize-antiretroviral-regimens-in-hiv-infected-women-who-want-to-breastfeed-panna-b/', 'https://www.zonmw.nl/nl/over-zonmw/e-health-en-ict-in-de-zorg/programmas/project-detail/active-and-assisted-living-aal2/u-topia-towards-empowering-older-persons-living-with-hiv/', ...]
data = []
for URL in URLs:
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
project_title = soup.find('h1').text
project_description = soup.find('div', class_="column-1").text
project_details = soup.find(class_="block-details")
project_number = project_details.find("strong").text
project_start = project_details.find("span", class_="bar-start").text
project_end = project_details.find("span", class_="bar-end").text
project_program = project_details.find("ul").text
for node in project_details.find_all("p"):
keywords = node.text.split(', ')
project_recipient = keywords[-1]
data.append((project_title, project_description, project_number, project_start, project_end, project_program, project_recipient))
et = time.time()
elapsed_time = et - st
print('Execution time:', elapsed_time, 'seconds')
Thank you so much!
How can I identify which link is not working?
res.status_code == 200
See the below sample for validation and improvement in code.
First check if XPath/class is available or not then use .text. Also, initialize empty variables inside for loop so it won't throw an attribute error if an element is not found.
for URL in URLs:
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
# initialize the variables. It takes the default value blank if XPath is unavailable.
project_number = project_start = ''
# xpath - [Do not use .text directly on xpath]
project_number_xp = project_details.find("strong")
project_start_xp = project_details.find("span", class_="bar-start")
# apply .text if XPath is available else it will throw an error.
if project_number_xp:
project_number = project_number_xp.text
if project_start_xp:
project_start = project_start_xp.text
# Append the data