Search code examples
pythonbeautifulsouppython-2.6

Parsing Environment Canada Website


I am trying to scrape the weather forecast from "https://weather.gc.ca/city/pages/ab-52_metric_e.html". With the code below I am able to get the table containing the data but I'm stuck. During the day the second row contains Today's forecast and the third row contains tonight's forecast. At the end of the day the second row becomes Tonight's forecast and Today's forecast is dropped. What I want to do is parse through the table to get the forecast for Today, Tonight, and each continuing day even if Today's forecast is missing; something like this:

Today: A mix of sun and cloud. 60 percent chance of showers this afternoon with risk of a thunderstorm. Widespread smoke. High 26. UV index 6 or high. Tonight: Partly cloudy. Becoming clear this evening. Increasing cloudiness before morning. Widespread smoke. Low 13. Friday: Mainly cloudy. Widespread smoke. Wind becoming southwest 30 km/h gusting to 50 in the afternoon. High 24.

#using Beautiful Soup 3, Python 2.6
from BeautifulSoup import BeautifulSoup
import urllib

pageFile = urllib.urlopen("https://weather.gc.ca/city/pages/ab-    52_metric_e.html")
pageHtml = pageFile.read()
pageFile.close()

soup = BeautifulSoup("".join(pageHtml))
data = soup.find("div", {"id": "mainContent"})

forecast = data.find('table',{'class':"table mrgn-bttm-md mrgn-tp-md     textforecast hidden-xs"})

Solution

  • You could do something like iterate over each line in the table and get the value of the rows. An example would be:

    forecast = data.find('table',{'class':"table mrgn-bttm-md mrgn-tp-md     textforecast hidden-xs"}).find_all("tr")
    for tr in forecast[1:]:
        print " ".join(tr.text.split())
    

    With this approach you get the contents of each lines (exclusive the first one which is some header.