Search code examples
pythonscrapy

Python Scrapy Function that does always work


The script below work 90% of the time to collect weather data. However, there are few cases where for some reason it just fails and the html code is consistant with the other request. There times where code is the same with the same request but it fails.

class NflweatherdataSpider(scrapy.Spider):
name = 'NFLWeatherData'
allowed_domains = ['nflweather.com']
# start_urls = ['http://nflweather.com/']

def __init__(self, Week='', Year='', Game='', **kwargs):
    self.start_urls = [f'https://nflweather.com/{Week}/{Year}/{Game}']  # py36
    self.Year = Year
    self.Game = Game
    super().__init__(**kwargs)
    print(self.start_urls)  # python3

def parse(self, response):
    self.log(self.start_urls)
    #self.log(self.Year)
    # pass
    # Extracting the content using css selectors
    # Extracting the content using css selectors
    game_boxes = response.css('div.game-box')

    for game_box in game_boxes:
        # Extracting date and time information
        Datetimes = game_box.css('.col-12 .fw-bold::text').get()

        # Extracting team information
        team_game_boxes = game_box.css('.team-game-box')
        awayTeams = team_game_boxes.css('.fw-bold::text').get()
        homeTeams = team_game_boxes.css('.fw-bold.ms-1::text').get()
        # Extracting temperature and probability information
        TempProbs = game_box.css('.col-md-4 .mx-2 span::text').get()

        # Extracting wind speed information
        windspeeds = game_box.css('.icon-weather + span::text').get()
        winddirection = game_box.css('.md-18 ::text').get()
        # Create a dictionary to store the scraped info
        scraped_info = {
            'Year': self.Year,
            'Game': self.Game,
            'Datetime': Datetimes.strip(),
            'awayTeam': awayTeams,
            'homeTeam': homeTeams,
            'TempProb': TempProbs,
            'windspeeds': windspeeds.strip(),
            'winddirection': winddirection.strip()
        }

        # Yield or give the scraped info to Scrapy
        yield scraped_info

These are the scrapy commands to run the crawler

scrapy crawl NFLWeatherData -a Week=week -a Year=2012 -a Game=week-6 -o NFLWeather_2012_week_6.json   
scrapy crawl NFLWeatherData -a Week=week -a Year=2012 -a Game=week-7 -o NFLWeather_2012_week_7.json
scrapy crawl NFLWeatherData -a Week=week -a Year=2012 -a Game=week-8 -o NFLWeather_2012_week_8.json

Week 6 crawl works perfect no issues

Week 7 Crawl returns nothing

ERROR: Spider error processing <GET https://nflweather.com/week/2012/week-7> (referer: None)
Traceback (most recent call last):
  File "G:\ProgramFiles\MiniConda3\envs\WrkEnv\lib\site-packages\scrapy\utils\defer.py", line 279, in iter_errback
    yield next(it)

Week 8 retrives 2 lines and errors out fro the rest

ERROR: Spider error processing <GET https://nflweather.com/week/2012/week-8> (referer: None)
Traceback (most recent call last):
  File "G:\ProgramFiles\MiniConda3\envs\WrkEnv\lib\site-packages\scrapy\utils\defer.py", line 279, in iter_errback
    yield next(it)

Any idea why these files fails and the other files have no issues?


Solution

  • The error lies with the windspeeds variable, sometimes the weather data is missing and so the windspeeds variable will be None, then when you attempt to create the dictionary object you envoke windspeeds.strip() which throws the exception.

    You could solve this by doing a simple None check when creating the dictionary, or you're free to do the check earlier, however fits your needs best. But here is a working example

    scraped_info = {
       'Year': self.Year,
       'Game': self.Game,
       'Datetime': Datetimes.strip(),
       'awayTeam': awayTeams,
       'homeTeam': homeTeams,
       'TempProb': TempProbs,
       'windspeeds': windspeeds.strip() if windspeeds is not None else "TBD",
       'winddirection': winddirection.strip() if winddirection is not None else "TBD"
    }
    

    You will also notice that "working" example you provided with week-6 will now contain more results than before