Search code examples
pythonseleniumemailbeautifulsoupdata-protection

Cannot scrape protected email from website


I want to scrape emails from this website but they are protected. They are visible on the website but while scraping a protected email appears which are decoded.

I have tried scraping but got this result

<a href="/cdn-cgi/l/email-protection#d5a7bba695b9a6b0b2fbb6bab8"><span class="__cf_email__" data-cfemail="c0b2aeb380acb3a5a7eea3afad">[email protected]</span></a>

My code:

from bs4 import BeautifulSoup as bs
import requests
import re


r = requests.get('https://www.accesswire.com/api/newsroom.ashx')
p = re.compile(r" \$\('#newslist'\)\.after\('(.*)\);")
html = p.findall(r.text)[0]
soup = bs(html, 'lxml')
headlines = [item['href'] for item in soup.select('a.headlinelink')]

for head in headlines:
        response2 = requests.get(head, headers=header)
        soup2 = bs(response2.content, 'html.parser')

        print([a for a in soup2.select("a")])

I want the emails that are in the body e.g. Email: [email protected] this email from this site https://www.accesswire.com/546295/Theramed-Provides-Update-on-New-Sales-Channel-for-Nevada-Facility but the email is being protected, how to scrape it in textual form like real email address? Thanks


Solution

  • I tried your code first and I too got [email protected]

    Then I realized website might be loading that data through JavaScript.

    You can get your work done using selenium or any light browser.

    I have used PyQt5 library to open the page as it would be opened in a JavaScript enabled browser then I get the source code from it and perform normal BeautifulSoup code.

    Prerequisite installations commands (If you are windows user):

    To install PyQt5 : pip install pyqt5

    PyQt5 windows distribution doesn't have PyQtWebEngine we need to install it separately:

    pip install PyQtWebEngine
    

    To render JavaScript based pages using pyqt4 I followed SentDex's video here : https://www.youtube.com/watch?v=FSH77vnOGqU

    But it was for pyqt4. To transit from pyqt4 to pyqt5 this StackOverflow answer helped me:

    https://stackoverflow.com/a/44432380/8810517

    My code:

    import requests
    import re
    from bs4 import BeautifulSoup as bs
    
    import sys
    from PyQt5.QtWidgets import QApplication
    from PyQt5.QtCore import QUrl
    from PyQt5.QtWebEngineWidgets import QWebEnginePage
    
    class Client(QWebEnginePage):
        def __init__(self,url):
            self.app = QApplication(sys.argv)
            QWebEnginePage.__init__(self)
    
            self.html=""
            self.loadFinished.connect(self.on_page_load)
    
            self.load(QUrl(url))
            self.app.exec_()
    
        def on_page_load(self):
            self.html=self.toHtml(self.Callable)
            print("In on_page_load \n \t HTML: ",self.html)
    
        def Callable(self,html_str):
            print("In Callable \n \t HTML_STR: ",len(html_str))
            self.html=html_str
            print("In Callable \n \t HTML_STR: ",len(self.html))
            self.app.quit()
    
    url="https://www.accesswire.com/546227/InterRent-Announces-Voting-Results-from-the-2019-Annual-and-Special-Meeting"
    
    client_response= Client(url)
    
    soup = bs(client_response.html, 'html.parser')
    table = soup.find_all('table')
    #print(len(table))
    table = table[len(table)-1]
    #print(table)
    a = table.find_all('a')
    #print(len(a))
    for i in a:
        print(i.text)
    
    

    Output:

    [email protected]
    [email protected]
    [email protected]