Search code examples
pythonweb-scrapingbeautifulsoupstartswith

Using startswith function to filter a list of urls


I have the following piece of code which extracts all links from a page and puts them in a list (links=[]), which is then passed to the function filter_links() . I wish to filter out any links that are not from the same domain as the starting link, aka the first link in the list. This is what I have:

import requests
from bs4 import BeautifulSoup
import re

start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
    links.append(tag['href'])


def filter_links(links):
    filtered_links = []
    for link in links:
        if link.startswith(links[0]):
            filtered_links.append(link)
        return filtered_links


print(filter_links(links))

I have used the built-in startswith function, but its filtering out everything except the starting url. Eventually I want to pass several different start urls through this program, so I need a generic way of filtering urls that are within the same domain as the starting url.I think I could use regex but this function should work too?


Solution

  • Try this :

    import requests
    from bs4 import BeautifulSoup
    import re
    import tldextract
    
    start_url = "http://www.enzymebiosystems.org/"
    r = requests.get(start_url)
    html_content = r.text
    soup = BeautifulSoup(html_content, features='lxml')
    links = []
    for tag in soup.find_all('a', href=True):
        links.append(tag['href'])
    
    def filter_links(links):
        ext = tldextract.extract(start_url)
        domain = ext.domain
        filtered_links = []
        for link in links:
            if domain in link:
                filtered_links.append(link)
        return filtered_links
    
    
    print(filter_links(links))
    

    Note :

    1. You need to get that return statement out of the for loop. It is just returning the result after iterating over just one element and thus only the first item inside a list is only getting returned.
    2. Use tldextract module to better extract the domain name from the urls. If you want to explicitly check whether the links starts with links[0], it's up to you.

    Output :

    ['http://enzymebiosystems.org', 'http://enzymebiosystems.org/', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/recent-developments/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/contact-us/', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/investors-media/news/', 'http://enzymebiosystems.org/investors-media/investor-relations/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/investors-media/stock-information/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/contact-us']