Search code examples
pythonhtmlbeautifulsoupscraperwunderground

How to use beautifulsoup when HTML element doesn't have a class name?


I am using the following code (slightly modified from Nathan Yau's "Visualize This" early example) to scrape weather data from WUnderGround's site. As you can see, python is grabbing the numeric data from the element with class name "wx-data".

However, I'd also like to grab the average humidity from the DailyHistory.htmml. The problem is that not all of the 'span' elements have a class name, which is the case for the average humidity cell. How can I select this particular cell using BeautifulSoup and the code below?

(Here is an example of the page being scraped - hit your dev mode and search for 'wx-data' to see the 'span' element being referenced:

http://www.wunderground.com/history/airport/LAX/2002/1/1/DailyHistory.html)

import urllib2
from BeautifulSoup import BeautifulSoup

year = 2004    


#create comma-delim file

f = open(str(year) + '_LAXwunder_data.txt','w')

#iterate through month and day
for m in range(1,13):
    for d in range (1,32):

        #Chk if already gone through month
        if (m == 2 and d > 28):
            break
        elif (m in [4,6,9,11]) and d > 30:
            break

        # open wug url
        timestamp = str(year)+'0'+str(m)+'0'+str(d)
        print 'Getting data for ' + timestamp
        url = 'http://www.wunderground.com/history/airport/LAX/'+str(year) + '/' + str(m) + '/' + str(d) + '/DailyHistory.html'
        page = urllib2.urlopen(url)

        #Get temp from page
        soup = BeautifulSoup(page)
        #dayTemp = soup.body.wx-data.b.string
        dayTemp = soup.findAll(attrs = {'class':'wx-data'})[5].span.string

        #Format month for timestamp
        if len(str(m)) < 2:
            mStamp = '0' + str(m)
        else:
            mStamp = str(m)
        #Format day for timestamp
        if len(str(d)) < 2:
            dStamp = '0' + str(d)
        else:
            dStamp = str(d)

        #Build timestamp
        timestamp = str(year)+ mStamp + dStamp

        #Wrtie timestamp and temp to file
        f.write(timestamp + ',' + dayTemp +'\n')

#done - close
f.close()

Solution

  • You can search for the cell containing the text, then move up and over to the next cell:

    humidity = soup.find(text='Average Humidity')
    next_cell = humidity.find_parent('td').find_next_sibling('td')
    humidity_value = next_cell.string
    

    I'm using BeautifulSoup version 4 here, not 3; you really want to upgrade as version 3 has been mothballed 2 years ago now.

    BeautifulSoup 3 can do this specific trick too; use findParent() and findNextSibling() instead there though.

    Demo:

    >>> import requests
    >>> from bs4 import BeautifulSoup
    >>> response = requests.get('http://www.wunderground.com/history/airport/LAX/2002/1/1/DailyHistory.html')
    >>> soup = BeautifulSoup(response.content)
    >>> humidity = soup.find(text='Average Humidity')
    >>> next_cell = humidity.find_parent('td').find_next_sibling('td')
    >>> next_cell.string
    u'88'