Search code examples
pythontextweb-scrapingbeautifulsouppubmed

Python - web scraping pubmed.gov abstracts w/ BeautifulSoup - getting nonetype error


I am web scraping abstracts from pubmed.gov and it's working for the most part, except for abstracts that have no text. I tried a IF statement, but I'm clearly not doing something right. How can I do this and have it skip over urls without abstract text? I've provided a URL where this happens.

I'm getting this error: AttributeError: 'NoneType' object has no attribute 'find'

Thanks in advance!

import requests
from bs4 import BeautifulSoup

listofa_urls = ['https://www.ncbi.nlm.nih.gov/pubmed/31103571']

for th in listofa_urls:

    response = requests.get(th)
    soup = BeautifulSoup(response.content, 'html.parser')

    if (soup.find(class_='abstr').find('div') is not None):
       div_ = soup.find(class_='abstr').find('div')
       if div_.find('h4'):
           h4_ = div_.find_all('h4')
           p_ = div_.find_all('p')
       else:
           h4_ = soup.find(class_='abstr').find_all('h3')
           p_ = soup.find(class_='abstr').find_all('p')

       mp = list(map(lambda x, y: [x.get_text(),y.get_text()], h4_, p_))
       print(mp)

Solution

  • As stated in the comments, you cannot do .find() to None, so just check if the first find finds anything.

    Just remove the second find:

    if (soup.find(class_='abstr').find('div') is not None):
    

    Becomes

    if (soup.find(class_='abstr') is not None)