Search code examples
pythonbeautifulsoupscreen-scraping

BeautifulSoup: scraping a table by class attribute -- why don't I get any data?


I'm trying to scrape the ticker symbols located here using BeautifulSoup. Currently, I've tried the following:

import urllib
import BeautifulSoup
import re

url  = r'https://investor.vanguard.com/mutual-funds/vanguard-mutual-funds-list'
html = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(html)

table = soup.findAll('td', attrs = {'class': re.compile(r'\bticker left\b')})

This doesn't, however, give me anything. Can someone explain why I can't get all td tags with this class attribute? The html would lead one to think this would be possible, and relatively painless. For example:

<td class="ticker left">VUSXX              </td>

Thank you.


Solution

  • Continuing my above comment... you can use the following url which returns the required data (obtained from firefox extension Live HTTP Header)

    https://api.vanguard.com/rs/ire/02/ind/mf/month-end.jsonp?callback=callback

    --

    You could also use Selenium which uses Firefox Browser.

    1) Install Selneium IDE http://docs.seleniumhq.org/download/

    2) Install Selenium Python module https://pypi.python.org/pypi/selenium

    Then u can use the following script.. which will run opens firefox browser.. and gets the results.

    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import re
    from bs4 import BeautifulSoup #use bs4 from now on.
    
    browser = webdriver.Firefox()
    
    browser.get('https://investor.vanguard.com/mutual-funds/vanguard-mutual-funds-list')
    
    html = browser.page_source
    soup = BeautifulSoup(html)
    
    mydata = soup.find_all('tr')
    

    And, you can find what you want in mydata