Not really sure the complexity of this question, but figured I'd give it a shot.
How can I create a web crawler/scraper (not sure which I'd need) to get a csv of all CEO pay-ratio data. https://www.bloomberg.com/graphics/ceo-pay-ratio/
I'd like this information for further analysis, however, I am not sure how to retrieve it for a dynamic webpage. I have built web scrapers in the past, but for simple websites and functions.
If you could point me to a good resource or post the code below I will forever be in your debt.
Thanks in advance!
Note that scraping this website may be flagged "as a violation of terms of service", this particular website use multiple tech to avoid the scraping based on script engine.
If you inspect the webpage, you may observe that when you click on the next button there is no XHR request. So you may deduce that the content are loaded only one time.
If you sort the request data by size, you will find that all data are loaded from a json file
Using python (but you need to open the page just before running the python script):
import requests
data=requests.get("https://www.bloomberg.com/graphics/ceo-pay-ratio/live-data/ceo-pay-ratio/live/data.json").json()
for each in data['companies']:
try:
print "Company",each['c'],"=> CEO pay ratio",each['cpr']
except:
print "Company",each['c'],"=> no CEO pay ratio !"
Which give you:
Company Aflac Inc => CEO pay ratio 300
Company American Campus Communities Inc => CEO pay ratio 226
Company Aetna Inc => CEO pay ratio 235
Company Ameren Corp => CEO pay ratio 66
Company AmerisourceBergen Corp => CEO pay ratio 0
Company Advance Auto Parts Inc => CEO pay ratio 329
Company American International Group Inc => CEO pay ratio 697
Company Arthur J Gallagher & Co => CEO pay ratio 126
Company Arch Capital Group Ltd => CEO pay ratio 104
Company ACADIA Pharmaceuticals Inc => CEO pay ratio 54
[...]
Maybe better to open the json in webrowser then save it locally than trying to request the website.
After local saving the json as data.json
you can read it with:
import json
with open("data.json","r") as f:
cont=f.read()
data=json.loads(cont)
for each in data['companies']:
try:
print "Company",each['c'],"=> CEO pay ratio",each['cpr']
except:
print "Company",each['c'],"=> no CEO pay ratio !"