Search code examples
pythonweb-scrapingbeautifulsoupurllib

Trouble extracting list of quotes from webpage using bs4 and Python


I want to use bs4 to navigate to a webpage and extract all the quotes on the page into a list.

I also want to extract the total number of pages of that specific person (an element on the bottom of the page)

The code I am currently using is this.

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup

listOfQuotes = []
website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
s = soup(webpage,"html.parser")
div_container = s.findAll("div", {"id":"quotesList"})

I am having issues searching the div_container object of the quotes.


Solution

  • The easiest thing is to find them by the title (all quotes have):

    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"r = requests.get(url)
    soup = BeautifulSoup(r.text)
    
    # We bring all the "a" that has the title "view quote"
    all_a_quotes = soup.find_all("a", attrs={"title": "view quote"})
    for a in all_a_quotes:
        # do something...
        print(a.text)
    

    This will output (60 in total):

    I'm in favour of religion as a tamer of arrogance. For a Greek Orthodox, the idea of God as creator outside the human is not God in God's terms. My God isn't the God of George Bush.
    You are rich if and only if money you refuse tastes better than money you accept.
    If you take risks and face your fate with dignity, there is nothing you can do that makes you small; if you don't take risks, there is nothing you can do that makes you grand, nothing.
    Steve Jobs, Bill Gates and Mark Zuckerberg didn't finish college. Too much emphasis is placed on formal education - I told my children not to worry about their grades but to enjoy learning.
    [...]
    Debt is a mistake between lender and borrower, and both should suffer.
    Capitalism is about adventurers who get harmed by their mistakes, not people who harm others with their mistakes.
    The next time you experience a blackout, take some solace by looking at the sky. You will not recognize it.
    

    For pagination we look if the last element "ul" exists (if it does not exist it only has one page), and if it exists we count how many "li" it has and we subtract 2:

    pagination = soup.select('ul[class*="pagination"]')
    if not pagination:
        pages = 0
    else:
        # we subtract two, that of next and that of previous 
        pages = len(pagination[0].find_all("li")) - 2