Search code examples
pythonweb-scrapingscreen-scrapingtext-processing

how to filter out certain parts of page when web scraping


I am new to web scraping, and am trying to extract only the 100 fun facts from the following webpage:

https://holypython.com/100-python-tips-tricks/

However, when using the following code, filler information such as various menus, etc are gathered.

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://holypython.com/100-python-tips-tricks/"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

An extract of the output is as follows:

100 Python Tips & Tricks | HolyPython.com
Skip to content
Holy Python
Blog
Support
Blog
Support
Machine Learning:
lin-reg
log-reg
knn
naive bayes
trees
random forest
svm
k-means
Machine Learning:
lin-reg
log-reg
knn
naive bayes
trees
random forest
svm
k-means
Learn Python:
Lessons
Exercises
Visualization

How can I remove this excess data, and then split the facts into 100 sections (fact 1, fact 2, and so on). Thank you in advance.


Solution

  • You were pretty close, here is a solution that exploits the CSS class of the numbered sections to extract the data you are looking for.

    Just to briefly explain, if you open up your browser and go to the inspector, you will notice that each numbered section is an h3 element with the classes elementor-heading-title elementor-size-default. We can go ahead and use that to our advantage to select the data you want.

    css selector

    The code:

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    
    url = "https://holypython.com/100-python-tips-tricks/"
    html = urlopen(url).read()
    soup = BeautifulSoup(html, features="html.parser")
    
    titles = soup.select("h3[class='elementor-heading-title elementor-size-default']")
    
    for title in titles:
        print(title.text)
    

    I will leave it as an exercise to you to format the data as you'd like.