Search code examples
pythonpython-3.xweb-scrapingpaginationpython-requests-html

Recursive Web Scraping Pagination


I'm trying to scrape some real estate articles from the following website:

Link

I manage to get the links I need,but I am struggling with pagination on the web page.I'm trying to scrape every link under each category 'building relationships', 'building your team', 'capital rising' etc.Some of these categories pages have pagination and some of them do not contain pagination.I tried with the following code but it just gives me the links from 2 page.

from requests_html import HTMLSession


def tag_words_links(url):
    global _session
    _request = _session.get(url)
    tags = _request.html.find('a.tag-cloud-link')
    links = []
    for link in tags:
        links.append({
             'Tags': link.find('a', first=True).text,
             'Links': link.find('a', first=True).attrs['href']
         })

    return links

def parse_tag_links(link):
    global _session
    _request = _session.get(link)
    articles = []
    try:
       next_page = _request.html.find('link[rel="next"]', first=True).attrs['href']
       _request = _session.get(next_page)
       article_links = _request.html.find('h3 a')
       for article in article_links:
           articles.append(article.find('a', first=True).attrs['href'])

    except:
        _request = _session.get(link)
        article_links = _request.html.find('h3 a')
        for article in article_links:
            articles.append(article.find('a', first=True).attrs['href'])


   return articles


if __name__ == '__main__':
   _session = HTMLSession()
   url = 'https://lifebridgecapital.com/podcast/'
   links = tag_words_links(url)
   print(parse_tag_links('https://lifebridgecapital.com/tag/multifamily/'))

Solution

  • To print title of every article under each tag and each page under the tag you can use this example:

    import requests
    from bs4 import BeautifulSoup
    
    
    url = "https://lifebridgecapital.com/podcast/"
    
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    tag_links = [a["href"] for a in soup.select(".tagcloud a")]
    
    for link in tag_links:
        while True:
            print(link)
            print("-" * 80)
    
            soup = BeautifulSoup(requests.get(link).content, "html.parser")
    
            for title in soup.select("h3 a"):
                print(title.text)
    
            print()
    
            next_link = soup.select_one("a.next")
            if not next_link:
                break
    
            link = next_link["href"]
    

    Prints:

    ...
    
    https://lifebridgecapital.com/tag/multifamily/
    --------------------------------------------------------------------------------
    WS890: Successful Asset Classes In The Current Market with Jerome Maldonado
    WS889: How To Avoid A $1,000,000 Mistake with Hugh Odom
    WS888: Value-Based On BRRRR VS Cap Rate with John Stoeber
    WS887: Slow And Steady Still Wins The Race with Nicole Pendergrass
    WS287: Increase Your NOI by Converting Units to Short Term Rentals with Michael Sjogren
    WS271: Investment Strategies To Survive An Economic Downturn with Vinney Chopra
    WS270: Owning a Construction Company Creates More Value with Abraham Ng’hwani
    WS269: The Impacts of Your First Deal with Kyle Mitchell
    WS260: Structuring Deals To Get The Best Return On Investment with Jeff Greenberg
    WS259: Capital Raising For Newbies with Bryan Taylor
    
    https://lifebridgecapital.com/tag/multifamily/page/2/
    --------------------------------------------------------------------------------
    WS257: Why Ground Up Development is the Best Investment with Sam Bates
    WS256: Mobile Home Park Investing: The Real Deal with Jefferson Lilly
    WS249: Managing Real Estate Paperwork Successfully with Krista Testani
    WS245: Multifamily Syndication with Venkat Avasarala
    WS244: Passive Investing In Real Estate with Kay Kay Singh
    WS243: Getting Started In Real Estate Brokerage with Tyler Chesser
    WS213: Data Analytics In Real Estate with Raj Tekchandani
    WS202: Ben Leybovich and Sam Grooms on The Advantages Of A Partnership In Real Estate Business
    WS199: Financial Freedom Through Real Estate Investing with Rodney Miller
    WS197: Loan Qualifications: How The Whole Process Works with Vinney Chopra
    
    https://lifebridgecapital.com/tag/multifamily/page/3/
    --------------------------------------------------------------------------------
    WS172: Real Estate Syndication with Kyle Jones
    
    ...