I am working on a project that involves extracting some data from the website. Specifically, I am interested in pulling out the name of each category along with its description.
I have considered using web scraping libraries like BeautifulSoup in Python, but I am not sure how to navigate through each category link to get the required information.
The website has multiple category names listed, and each category has its own page with parameters and descriptions. I am not sure how to programmatically "click" on each link to scrape the data.
import requests
from bs4 import BeautifulSoup
URL = "https://docs.derivative.ca/Category:CHOPs"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="mw-pages")
print(results.prettify())
chop_elements = results.find_all("div", class_="mw-content-ltr")
for chop_element in chop_elements:
print(chop_element, end="\n"*2)
<li><a href="/Analyze_CHOP" title="Analyze CHOP">Analyze CHOP</a></li>
<li><a href="/Angle_CHOP" title="Angle CHOP">Angle CHOP</a></li>
<li><a href="/Attribute_CHOP" title="Attribute CHOP">Attribute CHOP</a></li>
<li><a href="/Audio_Band_EQ_CHOP" title="Audio Band EQ CHOP">Audio Band EQ CHOP</a></li>
Website is https://docs.derivative.ca/Category:CHOPs
What is the best way to navigate through each category link and extract the required data? However, I'm not entirely sure what I'm doing, and I'm uncertain if I've inspected the HTML structure correctly. I'm looking for guidance on how to approach this problem.
You could request the page from each link and get the first line of the summary; something like:
rootUrl = 'https://docs.derivative.ca'
req = requests.get(rootUrl+'/Category:CHOPs')
req.raise_for_status() # in case of error
## getting the links
soup = BeautifulSoup(req.content, 'html.parser')
groups = soup.select('div.mw-category-group:has(h3~ul)')
chops = [{
'group': g.h3.get_text(strip=True),
'category': a.get_text(strip=True),
'link': rootUrl + a['href']
} for g in groups for a in g.select('li>a[href]')]
cLen = len(chops)
print('found', cLen, 'categories with links')
## getting the descriptions
for i, c in enumerate(chops):
# print(f'scraping {i+1} of {cLen}: {repr(c["category"])} from {c["link"]}')
cReq = requests.get(c['link'])
try: cReq.raise_for_status()
except: continue
cSoup = BeautifulSoup(cReq.content, 'html.parser')
summary_p1 = cSoup.select_one('h2:has(span#Summary)~p')
if summary_p1: chops[i]['description'] = summary_p1.get_text(strip=True)