Search code examples
pythonweb-scrapingbeautifulsoupscrapy

Beautiful Soup Email Protected


Hello i try to get email information from this code

 <a href="mailto:[email protected]">[email protected]</a><br><h3> Description </h3>

And my code is

text = BeautifulSoup(response.text,'lxml')
            def get_textchunk(word1, word2, text):
                if not (word1 in text and word2 in text): return ''
                return text.split(word1)[-1].split(word2)[0]
                ## can refine with more conditions/string-manipulations/regex/etc

           
            mail  = get_textchunk('Mail :', 'Description', soup.get_text(' '))
            print (mail)
      

But i print me [Email Protected]

How can i work to bypass this obstacle ?

Thnk you :)


Solution

  • bs4 solution

    you can use bs4 to parse the tree for you

    from bs4 import BeautifulSoup
    
    
    sample = ' <a href="mailto:[email protected]">[email protected]</a><br><h3> Description </h3>'
    
    soup = BeautifulSoup(sample, "lxml")
    
    print(soup.a.contents)
    

    Out

    ['[email protected]']
    

    (note: this returns a list object)

    And if you're looking at a large tree:

    for a in soup.find_all("a"):
        print(a.contents)  # same result
    

    regex solution

    *disclaimer, you shouldn't really use regex to parse html, that being said:

    sample = ' <a href="mailto:[email protected]">[email protected]</a><br><h3> Description </h3>'
    
    pattern = re.compile(r"href\s*?\=\s*?\"mailto\:(?P<email>[^\"]+?)\"")
    
    print(pattern.search(sample)["email"])
    

    Out:

    [email protected]
    

    for many:

    for matched in pattern.finditer(sample):
        print(matched["email"])  # same result