My objective is that for website https://data.eastmoney.com/executive/000001.html, when you scroll down you will find a big table
and I want to turn it into a DataFrame in Python. Is BeautifulSoup enough to do so or do I have to use Selenium?
Some people on Stack Overflow said that BeautifulSoup cannot crawl table data from the Internet, so I tried Selenium and here is the code:
driver = webdriver.Chrome()
driver.get('https://data.eastmoney.com/executive/000001.html')
table_element = driver.find_element_by_xpath("//table")
item_element = table_element.find_element_by_xpath("//tr[2]/td[3]")
item_text = item_element.text
df = pd.DataFrame([item_text], columns=["Item"])
print(df)
driver.quit()
and here is the outcome:
Traceback (most recent call last):
File "selenium/webdriver/common/service.py", line 76, in start
stdin=PIPE)
File "subprocess.py", line 709, in __init__
restore_signals, start_new_session)
File "subprocess.py", line 1344, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'chromedriver': 'chromedriver'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/jqcore/jqboson/jqboson/core/entry.py", line 379, in _run
engine.start()
File "/tmp/jqcore/jqboson/jqboson/core/engine.py", line 231, in start
self._dispatcher.start()
File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 280, in start
self._run_loop()
File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 240, in _run_loop
self._loop.run()
File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 107, in run
self._handle_queue()
File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 153, in _handle_queue
message.callback(**message.callback_data)
File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_subscriber.py", line 228, in broadcast
consumer.send(market_data)
File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 59, in consumer_gen
msg_callback()
File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 52, in msg_callback
callback(market_data)
File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 122, in wrapper
result = callback(*args, **kwargs)
File "/tmp/jqcore/jqboson/jqboson/core/strategy.py", line 474, in _wrapper
self._context.current_dt
File "/tmp/strategy/user_code.py", line 85, in handle_data
driver = webdriver.Chrome()
File "selenium/webdriver/chrome/webdriver.py", line 73, in __init__
self.service.start()
File "selenium/webdriver/common/service.py", line 83, in start
os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home
Basically it says "Chromedriver.exe needs to be in PATH". The problem is I am using an online backtest platform called JoinQuant (www.joinquant.com) and all the Python files such as File "selenium/webdriver/common/service.py" are not local - they are not on my computer's disk drive. So that's something complicated with Selenium - Do I have to use Selenium to crawl data like this from the Internet and turn it into DataFrame in Python? Or can I use something else like BeautifulSoup? For BeautifulSoup, as least it does not have the "drive needs to be in PATH" problem.
For BeautifulSoup, here's what I tried:
# Web Crawler
# Sent HTTP Request to get Internet content
url = 'https://data.eastmoney.com/executive/000001.html'
response = requests.get(url)
html_content = response.text
# Check if the request is successful
if response.status_code == 200:
# Use BeautifulSoup to Analyze Internet information and get the table
soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find_all('table')
# Acquire the rows and columns of the table
rows = table.find_all('tr')
data = []
for row in rows:
cols = row.find_all('td')
row_data = []
for col in cols:
row_data.append(col.text.strip())
data.append(row_data)
else:
print("Failed to Retrieve the Webpage.")
# Set up DataFrame
dataframe = pd.DataFrame(data)
# Print DataFrame
print(dataframe)
and here's the output:
Traceback (most recent call last):
File "/tmp/jqcore/jqboson/jqboson/core/entry.py", line 379, in _run
engine.start()
File "/tmp/jqcore/jqboson/jqboson/core/engine.py", line 231, in start
self._dispatcher.start()
File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 280, in start
self._run_loop()
File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 240, in _run_loop
self._loop.run()
File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 107, in run
self._handle_queue()
File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 153, in _handle_queue
message.callback(**message.callback_data)
File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_subscriber.py", line 228, in broadcast
consumer.send(market_data)
File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 59, in consumer_gen
msg_callback()
File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 52, in msg_callback
callback(market_data)
File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 122, in wrapper
result = callback(*args, **kwargs)
File "/tmp/jqcore/jqboson/jqboson/core/strategy.py", line 474, in _wrapper
self._context.current_dt
File "/tmp/strategy/user_code.py", line 114, in handle_data
rows = table.find_all('tr')
File "bs4/element.py", line 1884, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
But if you change
table = soup.find_all('table')
into
table = soup.find('table')
Here's the outcome:
Traceback (most recent call last):
File "/tmp/jqcore/jqboson/jqboson/core/entry.py", line 379, in _run
engine.start()
File "/tmp/jqcore/jqboson/jqboson/core/engine.py", line 231, in start
self._dispatcher.start()
File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 280, in start
self._run_loop()
File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 240, in _run_loop
self._loop.run()
File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 107, in run
self._handle_queue()
File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 153, in _handle_queue
message.callback(**message.callback_data)
File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_subscriber.py", line 228, in broadcast
consumer.send(market_data)
File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 59, in consumer_gen
msg_callback()
File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 52, in msg_callback
callback(market_data)
File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 122, in wrapper
result = callback(*args, **kwargs)
File "/tmp/jqcore/jqboson/jqboson/core/strategy.py", line 474, in _wrapper
self._context.current_dt
File "/tmp/strategy/user_code.py", line 114, in handle_data
rows = table.find_all('tr')
AttributeError: 'NoneType' object has no attribute 'find_all'
So to sum all up, which one should I use? Selenium or BeautifulSoup? Or even something else? And how should I tackle this issue?
No need to use selenium
or beautifulsoup
, in my opinion the easiest/directest way is to use the API through which the data is pulled.
How to know if content is loaded / rendered dynamically in this case?
First indicator, call up the website as a human in the browser and notice that a loading animation / delay appears for the area. Second indicator, the content is not included in the static response to the request. You can now use the browser's developer tools to look at the XHR Requests tab to see which data is being loaded from which resources. -> http://developer.chrome.com/docs/devtools/network
If there is an api use it else go with selenium
.
Url:
https://datacenter-web.eastmoney.com/api/data/v1/get
Parameters:
reportName: RPT_EXECUTIVE_HOLD_DETAILS
columns: ALL
filter: (SECURITY_CODE="000001")
pageNumber: 1
pageSize: 100 #increase this to avoid paging
import requests
import pandas as pd
pd.DataFrame(
requests.get('https://datacenter-web.eastmoney.com/api/data/v1/get?reportName=RPT_EXECUTIVE_HOLD_DETAILS&columns=ALL&filter=(SECURITY_CODE%3D%22000001%22)&pageNumber=1&pageSize=30')\
.json().get('result').get('data')
)
SECURITY_CODE | DERIVE_SECURITY_CODE | SECURITY_NAME | CHANGE_DATE | PERSON_NAME | CHANGE_SHARES | AVERAGE_PRICE | CHANGE_AMOUNT | CHANGE_REASON | CHANGE_RATIO | CHANGE_AFTER_HOLDNUM | HOLD_TYPE | DSE_PERSON_NAME | POSITION_NAME | PERSON_DSE_RELATION | ORG_CODE | GGEID | BEGIN_HOLD_NUM | END_HOLD_NUM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 000001 | 000001.SZ | 平安银行 | 2021-09-06 00:00:00 | 谢永林 | 26700 | 18.01 | 480867 | 竞价交易 | 0.0001 | 26700 | A股 | 谢永林 | 董事 | 本人 | 10004085 | 173000004782302008 | nan | 26700 |
1 | 000001 | 000001.SZ | 平安银行 | 2021-09-06 00:00:00 | 项有志 | 4000 | 18.46 | 73840 | 竞价交易 | 0.0001 | 26000 | A股 | 项有志 | 董事,副行长,首席财务官 | 本人 | 10004085 | 173000004782302010 | nan | 26000 |
... | |||||||||||||||||||
32 | 000001 | 000001.SZ | 平安银行 | 2009-08-19 00:00:00 | 刘巧莉 | 46200 | 21.04 | 972048 | 竞价交易 | 0.0015 | nan | A股 | 马黎民 | 监事 | 配偶 | 10004085 | 140000000281406241 | nan | nan |
33 | 000001 | 000001.SZ | 平安银行 | 2007-07-09 00:00:00 | 王魁芝 | 1600 | 27.9 | 44640 | 二级市场买卖 | 0.0001 | 7581 | A股 | 王魁芝 | 监事 | 本人 | 10004085 | 173000001049726006 | 5981 | 7581 |