Search code examples
pythonweb-scrapingbeautifulsouppython-re

I can't find a table using bs4, and I found an alternative using `re`, but I'm not sure how to get the information I need


I wanted to create a dictionary where I would pull the holdings as the key along with the Weight(%) as the value. But when I try to use soup.find('table', {'id' : 'etf_holding_table'}) to access the table, nothing shows up. I saw some posts saying that it might be inside a comment and tried to copy a few ways things were done there, but I wasn't able to do so successfully.

I ended up finding someone's response to a way to pull ticker information using re, but I can't seem to find any good resources explaining what his code was doing.
Here is the post I copied code from to get the ticker.

import requests
import re

keys = ['ARKK']
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0"}
url = 'https://www.zacks.com/funds/etf/ARKK/holding'

with requests.Session() as req:
        req.headers.update(headers)
        for key in keys:
            r = req.get(url.format(key))
            goal = re.findall(r'etf\\\/(.*?)\\', r.text)
            print(goal)
OUTPUT
['TSLA', 'TDOC', 'COIN', 'ROKU', 'U', 'ZM', 'SPOT', 'SQ', 'SHOP', 'PATH', 'TWLO', 'EXAS', 'NTLA', 'Z', 'PLTR', 'CRSP', 'TWTR', 'BEAM', 'DKNG', 'NVTA', 'FATE', 'TXG', 'DOCU', 'HOOD', 'PACB', 'PD', 'IRDM', 'TSP', 'DNA', 'TWST', 'VCYT', 'SGFY', 'SKLZ', 'EDIT', 'TWOU', 'IOVA', 'SSYS', 'TRMB', 'BLI', 'MTLS', 'CERS', 'CGEN', 'PRLB', 'NSTG']

By playing around a little with the re.findall() I was sort of able to get the information I wanted (highlighted in yellow), but I'm not sure how to get the numbers I need now. messing around

I'd very much appreciate some good resources on understanding and using re as I clearly don't understand too well what I'm doing, or how to get the information I need. I would message the poster I got the code from, but it seems as though you can't message someone on Stack Overflow.


Solution

  • The data is contained inside a "javascript variable" etf_holdings.formatted_data

    <script>
    var etf_holdings            = {};
    etf_holdings.formatted_data = [ [ "TESLA INC", ...
    

    This is processed by javascript in your browswer and turned into the table you see.

    When you fetch the raw html with requests - you're not executing javascript - which is why the table is "not there" when you try to find it with BeautifulSoup.

    One way of isolating the line containing the data:

    >>> import json, requests
    >>> r = requests.get('https://www.zacks.com/funds/etf/ARKK/holding', headers={'User-Agent': ''})
    >>> line = next(line for line in r.text.splitlines() if line.startswith('etf_holdings.formatted_data '))
    

    If we keep the [ [ .... ] ] i.e. remove everything before the first = and chop off the trailing semicolon - we can load this using the json module.

    >>> line[:40]
    'etf_holdings.formatted_data = [ [ "TESLA'
    >>> line[-10:]
    '</a>" ] ];'
    

    .find() and slicing is one way to do this:

    >>> line = line[line.find('= ') + 2:-1]
    >>> data = json.loads(line)
    >>> len(data)
    45
    

    Each item is the raw data used to create a row in table:

    >>> data[0]
    ['TESLA INC',
     '<button class="modal_external appear-on-focus" 
        href="/modals/quick-quote.php" rel="TSLA">TSLA Quick Quote</button>
        <a href="//www.zacks.com/funds/etf/TSLA" rel="TSLA" 
        class=" hoverquote-container-od " show-add-portfolio="true" >
        <span class="hoverquote-symbol">TSLA<span class="sr-only"></span></span></a>',
     '2,073,604',
     '11.18',
     '8.48',
     '<a class="report_document newwin" 
       href="/zer/report/TSLA" alt="View Report" title="View Report"></a>']