python web-scraping beautifulsoup scrapy

Beautiful Soup Email Protected

Hello i try to get email information from this code

 <a href="mailto:CONTACT@domain.COM">CONTACT@domain.COM</a><br><h3> Description </h3>

And my code is

text = BeautifulSoup(response.text,'lxml')
            def get_textchunk(word1, word2, text):
                if not (word1 in text and word2 in text): return ''
                return text.split(word1)[-1].split(word2)[0]
                ## can refine with more conditions/string-manipulations/regex/etc

           
            mail  = get_textchunk('Mail :', 'Description', soup.get_text(' '))
            print (mail)

But i print me [Email Protected]

How can i work to bypass this obstacle ?

Thnk you :)

Solution

bs4 solution

you can use bs4 to parse the tree for you

from bs4 import BeautifulSoup


sample = ' <a href="mailto:CONTACT@domain.COM">CONTACT@domain.COM</a><br><h3> Description </h3>'

soup = BeautifulSoup(sample, "lxml")

print(soup.a.contents)

Out

['CONTACT@domain.COM']

(note: this returns a list object)

And if you're looking at a large tree:

for a in soup.find_all("a"):
    print(a.contents)  # same result

regex solution

*disclaimer, you shouldn't really use regex to parse html, that being said:

sample = ' <a href="mailto:CONTACT@domain.COM">CONTACT@domain.COM</a><br><h3> Description </h3>'

pattern = re.compile(r"href\s*?\=\s*?\"mailto\:(?P<email>[^\"]+?)\"")

print(pattern.search(sample)["email"])

Out:

CONTACT@domain.COM

for many:

for matched in pattern.finditer(sample):
    print(matched["email"])  # same result