Code I have used in a Jupyter Notebook:
import requests
from bs4 import BeautifulSoup
import requests
corpus = ""
for x in range(10000):
URL = "https://en.wikipedia.org/wiki/Special:Random"
page = requests.get(URL)
html = page.text
soup = BeautifulSoup(html)
text = soup.p.text
text = text.replace('[1]', '')
text = text.replace('[2]', '')
text = text.replace('[3]', '')
text = text.replace('[4]', '')
text = text.replace('[5]', '')
text = text.replace('[6]', '')
text = text.replace('[7]', '')
text = text.replace('[8]', '')
text = text.replace('[9]', '')
text = text.strip()
corpus += text
print(x)
with open('Wikipedia Corpus.txt', 'w') as f:
f.write(corpus)
Error I get:
AttributeError Traceback (most recent call last)
/tmp/ipykernel_8985/763917129.py in <module>
11
12 soup = BeautifulSoup(html)
---> 13 text = soup.p.text
14
15 text = text.replace('[1]', '')
AttributeError: 'NoneType' object has no attribute 'text'
Could this error have been caused by a temporary internet disconnection? I do not know why this code stops working after about 1000 iterations.
Id reckon that means which ever page it was you were just trying to scrape had a paragraph tag with no values or did not even contain a paragraph to begin with. Throw it into a try: except:
so when you get the error you can print the url of the webpage. With that you can look at the html source to see what is causing the problem. Outside of that, there is not much of a way any of us can help you since you are accessing random Wikipedia articles. Each webpage will be formatted differently and sometimes Wikipedia pages don't contain much, if any, data.