Hope you're all well. I wrote a basic webscrape of an HTML site earlier today, along similar lines. I was following a tutorial, as you'll be able to see by my code I'm a bit of a green-horn to coding in Python. Hoping for a bit of guidance regarding scraping this site.
As you can see by the commented out code,
#print(results.prettify())
I am able to successfully able to print out the entire contents of the webpage. What I'd like to do however, is whittle down the contents of what I am printing out, so that I am just printing out the relevant content. There is a lot of content on the page that I don't want, and I'd like massage it out. Does anyone have any thoughts on why the for loop at the bottom of the code is not sequentially grabbing up the paragraphs in the xlmins unit of HTML and printing it out? Please see the below code for more.
import requests
from bs4 import BeautifulSoup
URL = "http://www.gutenberg.org/files/7142/7142-h/7142-h.htm"
page = requests.get(URL)
#we're going to create an object in Beautiful soup that will scrape it.
soup = BeautifulSoup(page.content, 'html.parser')
#this line of code takes
results = soup.find(xmlns='http://www.w3.org/1999/xhtml')
#print(results.prettify())
job_elems = results.find_all('p', xlmins="http://www.w3.org/1999/xhtml")
for job in job_elems:
paragraph = job.find("p", xlmins='http://www.w3.org/1999/xhtml')
print(paragraph.text.strip)
No <p>
tag contains the attribute xlmins='http://www.w3.org/1999/xhtml'
, only the top HTML tag does. Remove that part and you'll get all the paragraphs.
job_elems = results.find_all('p')
for job in job_elems:
print(job.text.strip())