Search code examples
pythonweb-scrapingfinancial

Python: Scraping a CSV file request


A frequent and long lurker on here: I usually find my questions answered on here. However, I have come across perhaps a simple, yet vague project that escapes me. I am fairly new to Python (currently using ver 3.6).

I am looking at: https://www.ishares.com/us/products/239726/

From what I can tell, there is some jquery stuff involved here: looking near the "Holdings" portion of the page. Instead of 'Top 10' selected, if 'All' is selected, there is an option to get holdings 'as of.'

If a specific historical month is selected, a prompt to download a .csv is created. What I would like to do is get each csv file that is produced from the drop down list, going back to Sept 29, 2006. In other words, automatically downloading the .csv file that is produced for each request given through this drop down list.

To give some (not necessarily relevant) context, I am familiar with pandas and bs4, and perhaps some other less popular libraries. As background, I keep a couple of desk references: 'Beginning Python' by Magnus Lie Hetland and 'Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython' by Wes McKinney.

I would like some small direction on how to approach this issue that I may be overlooking. In other words, breadcrumbs are helpful, but not asking for anyone to do all this work for me. I would like to explore and learn as much as humanly possible.

What libraries/methods should I perhaps use? I understand this is completely open-ended, so I would like to stick to bs4 and Pandas as much as possible. Other libraries are helpful as well, but those would be the focus.

Thanks!


Solution

  • I would like some small direction on how to approach this issue

    Using your browser's developer tools, examine the network requests being made. You will see that, when you choose a historical month, a request is made. If you copy the URL from that request, you can paste it into your browser to see if you can "replay" the request to get the payload. I tested it, and you can. What's more, you can see the query parameters quite clearly. They are not obfuscated. This means you can programatically generate URLs that you can then use cURL or wget on.

    Do note that I tried to specify a file type of "csv" and got an empty response, but when I requested a file type of "json" I got the data. YMMV. Good luck!