I try to scrap a dynamic table called "holding" from https://www.ishares.com/us/products/268752/ishares-global-reit-etf
At first I use selenium but I got empty DataFrame. Then community here helps suggest me to induce "WebDriverWait" to fully load the data before extracting it. It works but the data I got is truncated from 400 rows down to only 10 rows. How can I get all the data I need. Anyone could help me please. Thank you.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
# Instantiate options
options = webdriver.ChromeOptions()
options.headless = True
# Instantiate a webdriver
site = 'https://www.ishares.com/us/products/268752/ishares-global-reit-etf'
wd = webdriver.Chrome('chromedriver', options=options)
wd.get(site)
# Induce WebDriver Wait
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
wd.execute_script("arguments[0].scrollIntoView();", WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@data-componentname]/h2[normalize-space()='Holdings']"))))
data = WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@aria-describedby='allHoldingsTable_info']"))).get_attribute("outerHTML")
data2 = pd.read_html(data)
holding = data2[0]
The code you wrote is ok, but you missed one point. The table as a default is designed with pagination, which shows only 10 records per page, and hence you retrieved only those records. You have to add an additional action step (clicking on 'Show More' button) which would show all records, and thus your df would have all of them. Here is the refactored code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time
# Instantiate options
opt = Options()
opt.add_argument("headless")
opt.add_argument("disable-gpu")
opt.add_argument("window-size=1920,1080")
# Instantiate a webdriver
site = 'https://www.ishares.com/us/products/268752/ishares-global-reit-etf'
wd = webdriver.Chrome('chromedriver', options=opt)
wd.maximize_window()
wd.get(site)
# Induce WebDriver Wait
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
wd.execute_script("arguments[0].scrollIntoView();", WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@data-componentname]/h2[normalize-space()='Holdings']"))))
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "(//*[@class='datatables-utilities ui-helper-clearfix']//*[text()='Show More'])[2]"))).click()
data = WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@aria-describedby='allHoldingsTable_info']"))).get_attribute("outerHTML")
data2 = pd.read_html(data)
holding = data2[0]
print(holding)
Output:
Ticker Name ... SEDOL Accrual Date
0 PLD PROLOGIS REIT INC ... B44WZD7 -
1 EQIX EQUINIX REIT INC ... BVLZX12 -
2 PSA PUBLIC STORAGE REIT ... 2852533 -
3 SPG SIMON PROPERTY GROUP REIT INC ... 2812452 -
4 DLR DIGITAL REALTY TRUST REIT INC ... B03GQS4 -
.. ... ... ... ... ...
379 MYR MYR/USD ... - -
380 MYR MYR/USD ... - -
381 MYR MYR/USD ... - -
382 MYR MYR/USD ... - -
383 MYR MYR/USD ... - -
[384 rows x 12 columns]
Process finished with exit code 0