Search code examples
pythonhtmlbeautifulsoupparagraph

Scraping a random Wikipedia article works for about 1000 iterations with Beautiful Soup until I get an attribute error


Code I have used in a Jupyter Notebook:

import requests
from bs4 import BeautifulSoup
import requests

corpus = ""

for x in range(10000):
    URL = "https://en.wikipedia.org/wiki/Special:Random"
    page = requests.get(URL)
    html = page.text

    soup = BeautifulSoup(html)
    text = soup.p.text

    text = text.replace('[1]', '')
    text = text.replace('[2]', '')
    text = text.replace('[3]', '')
    text = text.replace('[4]', '')
    text = text.replace('[5]', '')
    text = text.replace('[6]', '')
    text = text.replace('[7]', '')
    text = text.replace('[8]', '')
    text = text.replace('[9]', '')
    text = text.strip()
    corpus += text
    print(x)

with open('Wikipedia Corpus.txt', 'w') as f:
f.write(corpus)

Error I get:

AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_8985/763917129.py in <module>
 11 
 12     soup = BeautifulSoup(html)
 ---> 13     text = soup.p.text
 14 
 15     text = text.replace('[1]', '')

 AttributeError: 'NoneType' object has no attribute 'text'

Could this error have been caused by a temporary internet disconnection? I do not know why this code stops working after about 1000 iterations.


Solution

  • Id reckon that means which ever page it was you were just trying to scrape had a paragraph tag with no values or did not even contain a paragraph to begin with. Throw it into a try: except: so when you get the error you can print the url of the webpage. With that you can look at the html source to see what is causing the problem. Outside of that, there is not much of a way any of us can help you since you are accessing random Wikipedia articles. Each webpage will be formatted differently and sometimes Wikipedia pages don't contain much, if any, data.