I am using the following code (slightly modified from Nathan Yau's "Visualize This" early example) to scrape weather data from WUnderGround's site. As you can see, python is grabbing the numeric data from the element with class name "wx-data".
However, I'd also like to grab the average humidity from the DailyHistory.htmml. The problem is that not all of the 'span' elements have a class name, which is the case for the average humidity cell. How can I select this particular cell using BeautifulSoup and the code below?
(Here is an example of the page being scraped - hit your dev mode and search for 'wx-data' to see the 'span' element being referenced:
http://www.wunderground.com/history/airport/LAX/2002/1/1/DailyHistory.html)
import urllib2
from BeautifulSoup import BeautifulSoup
year = 2004
#create comma-delim file
f = open(str(year) + '_LAXwunder_data.txt','w')
#iterate through month and day
for m in range(1,13):
for d in range (1,32):
#Chk if already gone through month
if (m == 2 and d > 28):
break
elif (m in [4,6,9,11]) and d > 30:
break
# open wug url
timestamp = str(year)+'0'+str(m)+'0'+str(d)
print 'Getting data for ' + timestamp
url = 'http://www.wunderground.com/history/airport/LAX/'+str(year) + '/' + str(m) + '/' + str(d) + '/DailyHistory.html'
page = urllib2.urlopen(url)
#Get temp from page
soup = BeautifulSoup(page)
#dayTemp = soup.body.wx-data.b.string
dayTemp = soup.findAll(attrs = {'class':'wx-data'})[5].span.string
#Format month for timestamp
if len(str(m)) < 2:
mStamp = '0' + str(m)
else:
mStamp = str(m)
#Format day for timestamp
if len(str(d)) < 2:
dStamp = '0' + str(d)
else:
dStamp = str(d)
#Build timestamp
timestamp = str(year)+ mStamp + dStamp
#Wrtie timestamp and temp to file
f.write(timestamp + ',' + dayTemp +'\n')
#done - close
f.close()
You can search for the cell containing the text, then move up and over to the next cell:
humidity = soup.find(text='Average Humidity')
next_cell = humidity.find_parent('td').find_next_sibling('td')
humidity_value = next_cell.string
I'm using BeautifulSoup version 4 here, not 3; you really want to upgrade as version 3 has been mothballed 2 years ago now.
BeautifulSoup 3 can do this specific trick too; use findParent()
and findNextSibling()
instead there though.
Demo:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> response = requests.get('http://www.wunderground.com/history/airport/LAX/2002/1/1/DailyHistory.html')
>>> soup = BeautifulSoup(response.content)
>>> humidity = soup.find(text='Average Humidity')
>>> next_cell = humidity.find_parent('td').find_next_sibling('td')
>>> next_cell.string
u'88'