I am trying to scrape data from this link https://www.morningstar.com/funds/xnas/gibix/portfolio -- basically all the data I can get, but particularly the Fixed Income Style Table and the Exposure, Bond Breakdown table.
Here is my code:
import requests
from selenium import webdriver
import pandas as pd
link = 'https://api-global.morningstar.com/sal-service/v1/fund/portfolio/holding/v2/F00000MUR2/data'
headers = {
'apikey': 'lstzFDEOhfFNMLikKa0am9mgEKLBl49T',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36'
}
payload = {
'premiumNum': '1000',
'freeNum': '1000',
'languageId': 'en',
'locale': 'en',
'clientId': 'MDC',
#'benchmarkId': 'mstarorcat',
'benchmarkId': 'category',
'component': 'sal-components-mip-holdings',
'version': '3.59.1'
}
with requests.Session() as s:
s.headers.update(headers)
resp = s.get(link,params=payload)
container = resp.json()
The above code is for what I have scraping the holdings data at the bottom. But it seems like I am having trouble figuring out what my 'component'
field in my header should be. I have tried even 'sal-components-fixed-income-exposure-analysis'
but to no avail.
What you are doing is not web scraping, but an API request. There's probably a way to get the data you want through the API but you might have to discover it from their docs: https://developer.morningstar.com/developer-resources/api-visualization-library/about
But I can provide you with a code snippet for actually scraping the data from this page:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from time import sleep
import pandas as pd
url = 'https://www.morningstar.com/funds/xnas/gibix/portfolio'
options = webdriver.ChromeOptions()
options.add_argument('--headless=new')
with webdriver.Chrome(service=Service(ChromeDriverManager().install()),
options=options) as driver:
driver.get(url)
sleep(10)
html = driver.page_source
tables = pd.read_html(html) #this will require lxml module
"tables" here is a list of dataframes from every table found in the page when fully loaded.
To install lxml module just pip install lxml
Ps: I tried getting the html with a request response but it's returning another page, looks like you gotta open the page and wait until it's fully loaded to get the correct source html.