Search code examples
pythonpython-3.xfunctionweb-scrapingreturn

How to handle InvalidSchema exception


I've written a script in python using two functions within it. The first function get_links() fetches some links from a webpage and returns those links to another function get_info(). At this point the function get_info() should produce the different shop names from different links but It throws an error raise InvalidSchema("No connection adapters were found for '%s'" % url).

This is my try:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

def get_links(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    elem = soup.select(".info h2 a[data-analytics]")
    return get_info(elem)

def get_info(url):
    response = requests.get(url)
    print(response.url)
    soup = BeautifulSoup(response.text,"lxml")
    return soup.select_one("#main-header .sales-info h1").get_text(strip=True)

if __name__ == '__main__':
    link = 'https://www.yellowpages.com/search?search_terms=%20Injury%20Law%20Attorneys&geo_location_terms=California&page=2'    
    for review in get_links(link):
        print(urljoin(link,review.get("href")))

The key thing that I'm trying to learn here is the real-life usage of return get_info(elem)

I created another thread concerning this return get_info(elem). Link to that thread.

When I try like the following, I get the results accordingly:

def get_links(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    elem = soup.select(".info h2 a[data-analytics]")
    return elem

def get_info(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    return soup.select_one("#main-header .sales-info h1").get_text(strip=True)

if __name__ == '__main__':
    link = 'https://www.yellowpages.com/search?search_terms=%20Injury%20Law%20Attorneys&geo_location_terms=California&page=2'    
    for review in get_links(link):
        print(get_info(urljoin(link,review.get("href"))))

My question: how can I get the results according to the way I tried with my first script making use of return get_info(elem)?


Solution

  • Inspect what is returned by each function. In this case, the function in your first script will never run. The reason because get_info takes in a URL, not anything else. So obviously you are going to hit an error when you run get_info(elem) where elem is a list of items that are selected by soup.select().

    You should already know the above though because you are iterating over the results from the second script which just returns the list to get the href elements. So if you want to use get_info in your first script, apply it on the items not the list, you can use a list comprehension in this case.

    import requests
    from urllib.parse import urljoin
    from bs4 import BeautifulSoup
    
    def get_links(url):
        response = requests.get(url)
        soup = BeautifulSoup(response.text,"lxml")
        elem = soup.select(".info h2 a[data-analytics]")
        return [get_info(urljoin(link,e.get("href"))) for e in elem] 
    
    def get_info(url):
        response = requests.get(url)
        soup = BeautifulSoup(response.text,"lxml")
        return soup.select_one("#main-header .sales-info h1").get_text(strip=True)
    
    link = 'https://www.yellowpages.com/search?search_terms=%20Injury%20Law%20Attorneys&geo_location_terms=California&page=2'
    
    for review in get_links(link): 
        print(review) 
    

    Now you know the first function still returns a list, but with get_info applied to its elements, which is how it works rite? get_info accepts a URL not a list. From there since you have already applied the url_join and get_info in get_links, you can loop it over to print the results.