Search code examples
pythonhtmlbeautifulsoupdata-extraction

How to remove<br> tag which is present in a tag while extracting text and href using python


The extraction of text and href works absolutely fine for all the countries but not for south Africa.

below cookie url has list of countries, here I need to extract only south Africa

Difference [< br >] tag present how to remove while extracting

cookie_url = "https://www.unilevernotices.com/cookie-notice/notice.html"
response = requests.get(cookie_url)
soup = BeautifulSoup(response.content, 'html.parser')

market = soup.findAll('div', class_=re.compile('richText-content'))

market_linkd = soup.findAll('a', text=re.compile(("Spain - Spanish"),re.IGNORECASE))
print(" extracted remaining country data ", market_linkd)   # result works fine

market_linkd = soup.findAll('a', text=re.compile(("South Africa - English"),re.IGNORECASE)) #.replace('<br>','')
print(" South aftrica data ", market_linkd)  # result []

for ml in market_linkd:
    print("*********************", ml)
    response = requests.get('https://www.unilevernotices.com'+ml['href'])
    soup = BeautifulSoup(response.content, "html.parser")
    cookie_title = soup.find('h1', class_=re.compile('title-heading'))
    cookie_link = 'https://www.unilevernotices.com'+ml['href']
    print(cookie_link)
    print(cookie_title)  






output:
********************* <a href="/spain/spanish/cookie-notice/notice.html" title="Spain - Spanish  ">Spain - Spanish</a>
https://www.unilevernotices.com/spain/spanish/cookie-notice/notice.html
<h1 class="title-heading">Aviso de cookies</h1>

output:
 South Africa data  []

Solution

  • There is white space in title section try this:

    market_linkd = soup.findAll('a', title=re.compile("South Africa - English  "), href=True) #.replace('<br>','')
    print(" South aftrica data ", market_linkd)