I wanted to create a dictionary where I would pull the holdings
as the key along with the Weight(%)
as the value. But when I try to use soup.find('table', {'id' : 'etf_holding_table'})
to access the table, nothing shows up. I saw some posts saying that it might be inside a comment and tried to copy a few ways things were done there, but I wasn't able to do so successfully.
I ended up finding someone's response to a way to pull ticker information using re
, but I can't seem to find any good resources explaining what his code was doing.
Here is the post I copied code from to get the ticker.
import requests
import re
keys = ['ARKK']
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0"}
url = 'https://www.zacks.com/funds/etf/ARKK/holding'
with requests.Session() as req:
req.headers.update(headers)
for key in keys:
r = req.get(url.format(key))
goal = re.findall(r'etf\\\/(.*?)\\', r.text)
print(goal)
OUTPUT
['TSLA', 'TDOC', 'COIN', 'ROKU', 'U', 'ZM', 'SPOT', 'SQ', 'SHOP', 'PATH', 'TWLO', 'EXAS', 'NTLA', 'Z', 'PLTR', 'CRSP', 'TWTR', 'BEAM', 'DKNG', 'NVTA', 'FATE', 'TXG', 'DOCU', 'HOOD', 'PACB', 'PD', 'IRDM', 'TSP', 'DNA', 'TWST', 'VCYT', 'SGFY', 'SKLZ', 'EDIT', 'TWOU', 'IOVA', 'SSYS', 'TRMB', 'BLI', 'MTLS', 'CERS', 'CGEN', 'PRLB', 'NSTG']
By playing around a little with the re.findall()
I was sort of able to get the information I wanted (highlighted in yellow), but I'm not sure how to get the numbers I need now.
I'd very much appreciate some good resources on understanding and using re
as I clearly don't understand too well what I'm doing, or how to get the information I need. I would message the poster I got the code from, but it seems as though you can't message someone on Stack Overflow.
The data is contained inside a "javascript variable" etf_holdings.formatted_data
<script>
var etf_holdings = {};
etf_holdings.formatted_data = [ [ "TESLA INC", ...
This is processed by javascript in your browswer and turned into the table you see.
When you fetch the raw html with requests
- you're not executing javascript - which is why the table is "not there" when you try to find it with BeautifulSoup.
One way of isolating the line containing the data:
>>> import json, requests
>>> r = requests.get('https://www.zacks.com/funds/etf/ARKK/holding', headers={'User-Agent': ''})
>>> line = next(line for line in r.text.splitlines() if line.startswith('etf_holdings.formatted_data '))
If we keep the [ [ .... ] ]
i.e. remove everything before the first =
and chop off the trailing semicolon - we can load this using the json
module.
>>> line[:40]
'etf_holdings.formatted_data = [ [ "TESLA'
>>> line[-10:]
'</a>" ] ];'
.find()
and slicing is one way to do this:
>>> line = line[line.find('= ') + 2:-1]
>>> data = json.loads(line)
>>> len(data)
45
Each item is the raw data used to create a row in table:
>>> data[0]
['TESLA INC',
'<button class="modal_external appear-on-focus"
href="/modals/quick-quote.php" rel="TSLA">TSLA Quick Quote</button>
<a href="//www.zacks.com/funds/etf/TSLA" rel="TSLA"
class=" hoverquote-container-od " show-add-portfolio="true" >
<span class="hoverquote-symbol">TSLA<span class="sr-only"></span></span></a>',
'2,073,604',
'11.18',
'8.48',
'<a class="report_document newwin"
href="/zer/report/TSLA" alt="View Report" title="View Report"></a>']