The extraction of text and href works absolutely fine for all the countries but not for south Africa.
below cookie url has list of countries, here I need to extract only south Africa
Difference [< br >] tag present how to remove while extracting
cookie_url = "https://www.unilevernotices.com/cookie-notice/notice.html"
response = requests.get(cookie_url)
soup = BeautifulSoup(response.content, 'html.parser')
market = soup.findAll('div', class_=re.compile('richText-content'))
market_linkd = soup.findAll('a', text=re.compile(("Spain - Spanish"),re.IGNORECASE))
print(" extracted remaining country data ", market_linkd) # result works fine
market_linkd = soup.findAll('a', text=re.compile(("South Africa - English"),re.IGNORECASE)) #.replace('<br>','')
print(" South aftrica data ", market_linkd) # result []
for ml in market_linkd:
print("*********************", ml)
response = requests.get('https://www.unilevernotices.com'+ml['href'])
soup = BeautifulSoup(response.content, "html.parser")
cookie_title = soup.find('h1', class_=re.compile('title-heading'))
cookie_link = 'https://www.unilevernotices.com'+ml['href']
print(cookie_link)
print(cookie_title)
output:
********************* <a href="/spain/spanish/cookie-notice/notice.html" title="Spain - Spanish ">Spain - Spanish</a>
https://www.unilevernotices.com/spain/spanish/cookie-notice/notice.html
<h1 class="title-heading">Aviso de cookies</h1>
output:
South Africa data []
There is white space in title section try this:
market_linkd = soup.findAll('a', title=re.compile("South Africa - English "), href=True) #.replace('<br>','')
print(" South aftrica data ", market_linkd)