Hello i try to get email information from this code
<a href="mailto:[email protected]">[email protected]</a><br><h3> Description </h3>
And my code is
text = BeautifulSoup(response.text,'lxml')
def get_textchunk(word1, word2, text):
if not (word1 in text and word2 in text): return ''
return text.split(word1)[-1].split(word2)[0]
## can refine with more conditions/string-manipulations/regex/etc
mail = get_textchunk('Mail :', 'Description', soup.get_text(' '))
print (mail)
But i print me [Email Protected]
How can i work to bypass this obstacle ?
Thnk you :)
you can use bs4 to parse the tree for you
from bs4 import BeautifulSoup
sample = ' <a href="mailto:[email protected]">[email protected]</a><br><h3> Description </h3>'
soup = BeautifulSoup(sample, "lxml")
print(soup.a.contents)
Out
['[email protected]']
(note: this returns a list object)
And if you're looking at a large tree:
for a in soup.find_all("a"):
print(a.contents) # same result
*disclaimer, you shouldn't really use regex to parse html, that being said:
sample = ' <a href="mailto:[email protected]">[email protected]</a><br><h3> Description </h3>'
pattern = re.compile(r"href\s*?\=\s*?\"mailto\:(?P<email>[^\"]+?)\"")
print(pattern.search(sample)["email"])
Out:
[email protected]
for many:
for matched in pattern.finditer(sample):
print(matched["email"]) # same result