Search code examples
pythonbeautifulsoupwunderground

Web Scraping with Wunderground data, BeautifulSoup


Okay, I'm at wit's end here. For my class, we are supposed to scrape data from the wunderground.com website. We keep running into issues (error messages), OR the code will run ok, but the .txt file will contain NO data. It's pretty annoying, because I need to do this! so here is my code.

f = open('wunder-data1.txt', 'w')
for m in range(1, 13):
for d in range(1, 32):
    if (m == 2 and d > 28):
        break
    elif (m in [4, 6, 9, 11] and d > 30):
        break
    url = "http://www.wunderground.com/history/airport/KBUF/2009/" + str(m) + "/" + str(d) + "/DailyHistory.html"
    page = urllib2.urlopen(url)
    soup = BeautifulSoup(page, "html.parser")
    dayTemp = soup.find("span", text="Mean Temperature").parent.find_next_sibling("td").get_text(strip=True)
    if len(str(m)) < 2:
        mStamp = '0' + str(m)
    else:
        mStamp = str(m)
    if len(str(d)) < 2:
        dStamp = '0' +str(d)
    else:
        dStamp = str(d)
    timestamp = '2009' + mStamp +dStamp
    f.write(timestamp.encode('utf-8') + ',' + dayTemp + '\n')
    f.close()

Also sorry, this code is probably not the correct indentations as it is in Python. I'm not any good at this.

UPDATE: So someone answered the question below, and it worked, but I realized I was pulling the wrong data (oops). So I put in this:

    import codecs
    import urllib2
    from bs4 import BeautifulSoup

    f = codecs.open('wunder-data2.txt', 'w', 'utf-8')

    for m in range(1, 13):
        for d in range(1, 32):
            if (m == 2 and d > 28):
                break
            elif (m in [4, 6, 9, 11] and d > 30):
                break

            url = "http://www.wunderground.com/history/airport/KBUF/2009/" + str(m) + "/" + str(d) + "/DailyHistory.html"
            page = urllib2.urlopen(url)
            soup = BeautifulSoup(page, "html.parser")

            dayTemp = soup.findAll(attrs={"class":"wx-value"})[5].span.string
            if len(str(m)) < 2:
                mStamp = '0' + str(m)
            else:
                mStamp = str(m)
            if len(str(d)) < 2:
                dStamp = '0' +str(d)
            else:
                dStamp = str(d)

            timestamp = '2009' + mStamp +dStamp

            f.write(timestamp.encode('utf-8') + ',' + dayTemp + '\n')

    f.close()

So I'm pretty unsure. What I'm trying to do is data scrape the


Solution

  • I encountered the following errors (and fixed them below) when trying to execute your code:

    1. Indentation of the nested loops was invalid.
    2. Missing imports (the lines at the top), but maybe you just excluded them from your paste.
    3. Trying to write "utf-8" encoded strings to an "ascii" file. To fix this I used the codecs module to open the file f as "utf-8".
    4. The file was closed inside the loop, meaning that after writing to it the first time, it'd be closed and then the next write would fail (because it was closed). I moved the line to close the file to the outside of the loops.

    Now as far as I can tell (without you telling us what you actually want this code to do), it's working? At least no errors are immediately popping up...

    import codecs
    import urllib2
    from bs4 import BeautifulSoup
    
    f = codecs.open('wunder-data1.txt', 'w', 'utf-8')
    
    for m in range(1, 13):
        for d in range(1, 32):
            if (m == 2 and d > 28):
                break
            elif (m in [4, 6, 9, 11] and d > 30):
                break
    
            url = "http://www.wunderground.com/history/airport/KBUF/2009/" + str(m) + "/" + str(d) + "/DailyHistory.html"
            page = urllib2.urlopen(url)
            soup = BeautifulSoup(page, "html.parser")
    
            dayTemp = soup.find("span", text="Mean Temperature").parent.find_next_sibling("td").get_text(strip=True)
    
            if len(str(m)) < 2:
                mStamp = '0' + str(m)
            else:
                mStamp = str(m)
            if len(str(d)) < 2:
                dStamp = '0' +str(d)
            else:
                dStamp = str(d)
    
            timestamp = '2009' + mStamp +dStamp
    
            f.write(timestamp.encode('utf-8') + ',' + dayTemp + '\n')
    
    f.close()
    

    As the comments on your question have suggested, there are other areas for improvement here which I have not touched on - I've simply tried to get the code you posted executing.