Search code examples
pythonfunctionweb-scrapingskipcontinue

Python: Skip url in scraping process conditionally


I'm scraping property ads by BS4, using the following code,

# get_ad_page_urls collects all ad urls displayed on page
def get_ad_page_urls(link): 
    BS4_main(link) # BS4_main parses the link and returns the "container" object
    return [link.get("href") for link in container.findAll("a", href=re.compile("^(/inmueble/)((?!:).)*$"))]

# get_ad_data obtains data from each ad
def get_ad_data(ad_page_url):
    ad_data={}
    response=requests.get(root_url+ad_page_url)
    soup = bs4.BeautifulSoup(response.content, 'lxml')

    <collecting data code here>

    return ad_data

This works fine. By the following multiprocessing code, I scrape all the ads,

def show_ad_data(options):
    pool=Pool(options)
    for link in page_link_list:
        ad_page_urls = get_ad_page_urls(link)
        results=pool.map(get_ad_data, ad_page_urls)    

Now the issue:

Particular ads should be skipped. Those ads display a specific text, by which they are recognisable. I'm new to using def functions, I don't know how to tell the code to skip to the next ad_page_url.

I think the "skipping" code should be placed between soup = bs4.BeautifulSoup(response.content, 'lxml') and <collecting data code here>. Something like,

# "skipping" semi-code
for text in soup:
    if 'specific text' in text:
        continue

but I'm not sure if using def functions allows for applying continue on iterations.

How should I modify the code such that it skips an ad when the specific text is on the page?


Solution

  • Yes a continue or pass will move on to the next iteration skipping if a skip condition is met in the if statement:

    def get_ad_data(ad_page_url):
        ad_data={}
        response=requests.get(root_url+ad_page_url)
        soup = bs4.BeautifulSoup(response.content, 'lxml')
    
        for text in soup:
        if 'specific text' in text:
            continue #or pass
        else:
            <collecting data code here>
    
        return ad_data