Search code examples
pythonweb-scrapingbeautifulsouprecaptcha

Scraping data with Python from reCAPTCHA protecred website


I've been trying to scrape some info for personal use from a website. It works nice, no errors, but I found out it somehow can't see email addresses from second half of the site. Code I'm using:

import requests
from bs4 import BeautifulSoup

page = requests.get('https://rejestradwokatow.pl/adwokat/abramowicz-joanna-49486')
soup = BeautifulSoup(page.content, "html.parser")

kancelaria = [x.strip() for x in soup.find(
    'div', class_='mb_tab_content special_one').find_all('div')[::2][0].text.split('\n') if x != ''][1:]

with result:

>>> kancelaria
['Kancelaria Adwokacka', 'Chlebnicka 48/51', '80-830 Gdańsk', '', 'Stacjonarny/Fax: 583054010', 'Email: [email\xa0protected]']

Please take notice in last element: 'Email: [email\xa0protected]'. I believe it has something to do with reCAPTCHA mechanism implemented in the website, but I have no idea how to go around it. Interesting - emails from first half of the site are visible for my program and can be scraped. Anh thoughts?

EDIT: I'm reffering to the lower part of the page: enter image description here


Solution

  • I'm going to add another answer to this: That one is created by Javascript, and you may test it using Selenium. The code is provided below.

    from selenium import webdriver
    import chromedriver_autoinstaller
    
    # auto install chromedriver
    chromedriver = chromedriver_autoinstaller.install()
    
    # driver define and lunch
    driver = webdriver.Chrome(chromedriver)
    driver.maximize_window()
    
    # Go to website and get email
    url = 'https://rejestradwokatow.pl/adwokat/artymiak-grzegorz-46439'
    driver.get(url)
    email_text = driver.find_element_by_xpath('//div[@class="mb_tab_content special_one"]/div[@class="line_list_K"]').text.split('Email: ')
    email = email_text[-1]
    
    print(email)
    gartymiak@protonmail.com