I'm trying to scrape the ticker symbols located here using BeautifulSoup. Currently, I've tried the following:
import urllib
import BeautifulSoup
import re
url = r'https://investor.vanguard.com/mutual-funds/vanguard-mutual-funds-list'
html = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(html)
table = soup.findAll('td', attrs = {'class': re.compile(r'\bticker left\b')})
This doesn't, however, give me anything. Can someone explain why I can't get all td
tags with this class
attribute? The html would lead one to think this would be possible, and relatively painless. For example:
<td class="ticker left">VUSXX </td>
Thank you.
Continuing my above comment... you can use the following url which returns the required data (obtained from firefox extension Live HTTP Header)
https://api.vanguard.com/rs/ire/02/ind/mf/month-end.jsonp?callback=callback
--
You could also use Selenium which uses Firefox Browser.
1) Install Selneium IDE http://docs.seleniumhq.org/download/
2) Install Selenium Python module https://pypi.python.org/pypi/selenium
Then u can use the following script.. which will run opens firefox browser.. and gets the results.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import re
from bs4 import BeautifulSoup #use bs4 from now on.
browser = webdriver.Firefox()
browser.get('https://investor.vanguard.com/mutual-funds/vanguard-mutual-funds-list')
html = browser.page_source
soup = BeautifulSoup(html)
mydata = soup.find_all('tr')
And, you can find what you want in mydata