Search code examples
pythonpython-3.xweb-scrapingreturn

Unable to return all the results at once


I've written a script in python to fetch some links from a webpage. There are two functions within my script. The first function collect links to the local businesses from a webpage and the second function traverses those links and collect urls to the various events.

When I try with the script found here, I get desired results.

How can I return all the results complying the below design?

The following script return the results of individual links whereas I wish to return all the result at once keeping the design as it is (logic may vary).

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

linklist = []

def collect_links(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text, "lxml")
    items = [urljoin(url,item.get("href")) for item in soup.select(".business-listings-category-list .field-content a[hreflang]")]
    return items

def fetch_info(ilink):
    res = requests.get(ilink)
    soup = BeautifulSoup(res.text, "lxml")
    for item in soup.select(".business-teaser-title a[title]"):
        linklist.append(urljoin(url,item.get("href")))
    return linklist

if __name__ == '__main__':
    url = "https://www.parentmap.com/atlas"
    for itemlink in collect_links(url):
        print(fetch_info(itemlink))

Solution

  • First of all I removed the global linklist as it is returned from the function anyway, and keeping global creates overlapping results. Next I added a function to "assemble" the links the way you wanted. I used a set to prevent duplicate links.

    #!/usr/bin/python
    
    import requests
    from bs4 import BeautifulSoup
    from urllib.parse import urljoin
    
    def collect_links(link):
        res = requests.get(link)
        soup = BeautifulSoup(res.text, "lxml")
        items = [urljoin(url,item.get("href")) for item in soup.select(".business-listings-category-list .field-content a[hreflang]")]
        return items
    
    def fetch_info(ilink):
        linklist = []
        res = requests.get(ilink)
        soup = BeautifulSoup(res.text, "lxml")
        for item in soup.select(".business-teaser-title a[title]"):
            linklist.append(urljoin(url,item.get("href")))
        return linklist
    
    def fetch_all_links(url):
        links = set()
        for itemlink in collect_links(url):
            links.update(fetch_info(itemlink))
        return list(links)
    
    if __name__ == '__main__':
        url = "https://www.parentmap.com/atlas"
        print(fetch_all_links(url))