python dataframe selenium-webdriver web-scraping beautifulsoup

How to scrape table into dataframe with selenium / requests / beautifulsoup?

My objective is that for website https://data.eastmoney.com/executive/000001.html, when you scroll down you will find a big table and I want to turn it into a DataFrame in Python. Is BeautifulSoup enough to do so or do I have to use Selenium?

Some people on Stack Overflow said that BeautifulSoup cannot crawl table data from the Internet, so I tried Selenium and here is the code:

driver = webdriver.Chrome()
driver.get('https://data.eastmoney.com/executive/000001.html')
table_element = driver.find_element_by_xpath("//table")
item_element = table_element.find_element_by_xpath("//tr[2]/td[3]")
item_text = item_element.text
df = pd.DataFrame([item_text], columns=["Item"])
print(df)
driver.quit()

and here is the outcome:

Traceback (most recent call last):
  File "selenium/webdriver/common/service.py", line 76, in start
    stdin=PIPE)
  File "subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "subprocess.py", line 1344, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'chromedriver': 'chromedriver'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/jqcore/jqboson/jqboson/core/entry.py", line 379, in _run
    engine.start()
  File "/tmp/jqcore/jqboson/jqboson/core/engine.py", line 231, in start
    self._dispatcher.start()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 280, in start
    self._run_loop()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 240, in _run_loop
    self._loop.run()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 107, in run
    self._handle_queue()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 153, in _handle_queue
    message.callback(**message.callback_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_subscriber.py", line 228, in broadcast
    consumer.send(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 59, in    consumer_gen
    msg_callback()
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 52, in   msg_callback
    callback(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 122, in  wrapper
    result = callback(*args, **kwargs)
  File "/tmp/jqcore/jqboson/jqboson/core/strategy.py", line 474, in _wrapper
    self._context.current_dt
  File "/tmp/strategy/user_code.py", line 85, in handle_data
    driver = webdriver.Chrome()
  File "selenium/webdriver/chrome/webdriver.py", line 73, in __init__
    self.service.start()
  File "selenium/webdriver/common/service.py", line 83, in start
    os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in     PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home

Basically it says "Chromedriver.exe needs to be in PATH". The problem is I am using an online backtest platform called JoinQuant (www.joinquant.com) and all the Python files such as File "selenium/webdriver/common/service.py" are not local - they are not on my computer's disk drive. So that's something complicated with Selenium - Do I have to use Selenium to crawl data like this from the Internet and turn it into DataFrame in Python? Or can I use something else like BeautifulSoup? For BeautifulSoup, as least it does not have the "drive needs to be in PATH" problem.

For BeautifulSoup, here's what I tried:

# Web Crawler
# Sent HTTP Request to get Internet content
url = 'https://data.eastmoney.com/executive/000001.html'
response = requests.get(url)
html_content = response.text

# Check if the request is successful
if response.status_code == 200:
    # Use BeautifulSoup to Analyze Internet information and get the table
    soup = BeautifulSoup(html_content, 'html.parser')
    table = soup.find_all('table')
    # Acquire the rows and columns of the table
    rows = table.find_all('tr')
    data = []
    for row in rows:
        cols = row.find_all('td')
        row_data = []
        for col in cols:
            row_data.append(col.text.strip())
        data.append(row_data)
else:
    print("Failed to Retrieve the Webpage.")

# Set up DataFrame
dataframe = pd.DataFrame(data)
# Print DataFrame
print(dataframe)

and here's the output:

Traceback (most recent call last):
  File "/tmp/jqcore/jqboson/jqboson/core/entry.py", line 379, in _run
    engine.start()
  File "/tmp/jqcore/jqboson/jqboson/core/engine.py", line 231, in start
    self._dispatcher.start()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 280, in start
    self._run_loop()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 240, in _run_loop
    self._loop.run()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 107, in run
    self._handle_queue()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 153, in _handle_queue
    message.callback(**message.callback_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_subscriber.py", line 228, in broadcast
    consumer.send(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 59, in   consumer_gen
    msg_callback()
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 52, in  msg_callback
    callback(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 122, in  wrapper
    result = callback(*args, **kwargs)
  File "/tmp/jqcore/jqboson/jqboson/core/strategy.py", line 474, in _wrapper
    self._context.current_dt
  File "/tmp/strategy/user_code.py", line 114, in handle_data
    rows = table.find_all('tr')
  File "bs4/element.py", line 1884, in __getattr__
    "ResultSet object has no attribute '%s'. You're probably treating a list of items like a   single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of   items like a single item. Did you call find_all() when you meant to call find()?

But if you change

table = soup.find_all('table')

into

table = soup.find('table')

Here's the outcome:

Traceback (most recent call last):
  File "/tmp/jqcore/jqboson/jqboson/core/entry.py", line 379, in _run
    engine.start()
  File "/tmp/jqcore/jqboson/jqboson/core/engine.py", line 231, in start
    self._dispatcher.start()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 280, in start
    self._run_loop()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 240, in _run_loop
    self._loop.run()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 107, in run
    self._handle_queue()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 153, in _handle_queue
    message.callback(**message.callback_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_subscriber.py", line 228, in broadcast
    consumer.send(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 59, in   consumer_gen
    msg_callback()
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 52, in msg_callback
    callback(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 122, in wrapper
    result = callback(*args, **kwargs)
  File "/tmp/jqcore/jqboson/jqboson/core/strategy.py", line 474, in _wrapper
    self._context.current_dt
  File "/tmp/strategy/user_code.py", line 114, in handle_data
    rows = table.find_all('tr')
AttributeError: 'NoneType' object has no attribute 'find_all'

So to sum all up, which one should I use? Selenium or BeautifulSoup? Or even something else? And how should I tackle this issue?

Solution

No need to use selenium or beautifulsoup, in my opinion the easiest/directest way is to use the API through which the data is pulled.

How to know if content is loaded / rendered dynamically in this case?

First indicator, call up the website as a human in the browser and notice that a loading animation / delay appears for the area. Second indicator, the content is not included in the static response to the request. You can now use the browser's developer tools to look at the XHR Requests tab to see which data is being loaded from which resources. -> http://developer.chrome.com/docs/devtools/network

If there is an api use it else go with selenium.

Url:

https://datacenter-web.eastmoney.com/api/data/v1/get

Parameters:

reportName: RPT_EXECUTIVE_HOLD_DETAILS
columns: ALL
filter: (SECURITY_CODE="000001")
pageNumber: 1
pageSize: 100 #increase this to avoid paging

Example

import requests
import pandas as pd

pd.DataFrame(
    requests.get('https://datacenter-web.eastmoney.com/api/data/v1/get?reportName=RPT_EXECUTIVE_HOLD_DETAILS&columns=ALL&filter=(SECURITY_CODE%3D%22000001%22)&pageNumber=1&pageSize=30')\
        .json().get('result').get('data')
)

	SECURITY_CODE	DERIVE_SECURITY_CODE	SECURITY_NAME	CHANGE_DATE	PERSON_NAME	CHANGE_SHARES	AVERAGE_PRICE	CHANGE_AMOUNT	CHANGE_REASON	CHANGE_RATIO	CHANGE_AFTER_HOLDNUM	HOLD_TYPE	DSE_PERSON_NAME	POSITION_NAME	PERSON_DSE_RELATION	ORG_CODE	GGEID	BEGIN_HOLD_NUM	END_HOLD_NUM
0	000001	000001.SZ	平安银行	2021-09-06 00:00:00	谢永林	26700	18.01	480867	竞价交易	0.0001	26700	A股	谢永林	董事	本人	10004085	173000004782302008	nan	26700
1	000001	000001.SZ	平安银行	2021-09-06 00:00:00	项有志	4000	18.46	73840	竞价交易	0.0001	26000	A股	项有志	董事,副行长,首席财务官	本人	10004085	173000004782302010	nan	26000
...
32	000001	000001.SZ	平安银行	2009-08-19 00:00:00	刘巧莉	46200	21.04	972048	竞价交易	0.0015	nan	A股	马黎民	监事	配偶	10004085	140000000281406241	nan	nan
33	000001	000001.SZ	平安银行	2007-07-09 00:00:00	王魁芝	1600	27.9	44640	二级市场买卖	0.0001	7581	A股	王魁芝	监事	本人	10004085	173000001049726006	5981	7581