I'm scraping property ads by BS4, using the following code,
# get_ad_page_urls collects all ad urls displayed on page
def get_ad_page_urls(link):
BS4_main(link) # BS4_main parses the link and returns the "container" object
return [link.get("href") for link in container.findAll("a", href=re.compile("^(/inmueble/)((?!:).)*$"))]
# get_ad_data obtains data from each ad
def get_ad_data(ad_page_url):
ad_data={}
response=requests.get(root_url+ad_page_url)
soup = bs4.BeautifulSoup(response.content, 'lxml')
<collecting data code here>
return ad_data
This works fine. By the following multiprocessing code, I scrape all the ads,
def show_ad_data(options):
pool=Pool(options)
for link in page_link_list:
ad_page_urls = get_ad_page_urls(link)
results=pool.map(get_ad_data, ad_page_urls)
Now the issue:
Particular ads should be skipped. Those ads display a specific text, by which they are recognisable. I'm new to using def
functions, I don't know how to tell the code to skip to the next ad_page_url
.
I think the "skipping" code should be placed between soup = bs4.BeautifulSoup(response.content, 'lxml')
and <collecting data code here>
. Something like,
# "skipping" semi-code
for text in soup:
if 'specific text' in text:
continue
but I'm not sure if using def
functions allows for applying continue
on iterations.
How should I modify the code such that it skips an ad when the specific
text is on the page?
Yes a continue or pass will move on to the next iteration skipping if a skip condition is met in the if statement:
def get_ad_data(ad_page_url):
ad_data={}
response=requests.get(root_url+ad_page_url)
soup = bs4.BeautifulSoup(response.content, 'lxml')
for text in soup:
if 'specific text' in text:
continue #or pass
else:
<collecting data code here>
return ad_data