The script below work 90% of the time to collect weather data. However, there are few cases where for some reason it just fails and the html code is consistant with the other request. There times where code is the same with the same request but it fails.
class NflweatherdataSpider(scrapy.Spider):
name = 'NFLWeatherData'
allowed_domains = ['nflweather.com']
# start_urls = ['http://nflweather.com/']
def __init__(self, Week='', Year='', Game='', **kwargs):
self.start_urls = [f'https://nflweather.com/{Week}/{Year}/{Game}'] # py36
self.Year = Year
self.Game = Game
super().__init__(**kwargs)
print(self.start_urls) # python3
def parse(self, response):
self.log(self.start_urls)
#self.log(self.Year)
# pass
# Extracting the content using css selectors
# Extracting the content using css selectors
game_boxes = response.css('div.game-box')
for game_box in game_boxes:
# Extracting date and time information
Datetimes = game_box.css('.col-12 .fw-bold::text').get()
# Extracting team information
team_game_boxes = game_box.css('.team-game-box')
awayTeams = team_game_boxes.css('.fw-bold::text').get()
homeTeams = team_game_boxes.css('.fw-bold.ms-1::text').get()
# Extracting temperature and probability information
TempProbs = game_box.css('.col-md-4 .mx-2 span::text').get()
# Extracting wind speed information
windspeeds = game_box.css('.icon-weather + span::text').get()
winddirection = game_box.css('.md-18 ::text').get()
# Create a dictionary to store the scraped info
scraped_info = {
'Year': self.Year,
'Game': self.Game,
'Datetime': Datetimes.strip(),
'awayTeam': awayTeams,
'homeTeam': homeTeams,
'TempProb': TempProbs,
'windspeeds': windspeeds.strip(),
'winddirection': winddirection.strip()
}
# Yield or give the scraped info to Scrapy
yield scraped_info
These are the scrapy commands to run the crawler
scrapy crawl NFLWeatherData -a Week=week -a Year=2012 -a Game=week-6 -o NFLWeather_2012_week_6.json
scrapy crawl NFLWeatherData -a Week=week -a Year=2012 -a Game=week-7 -o NFLWeather_2012_week_7.json
scrapy crawl NFLWeatherData -a Week=week -a Year=2012 -a Game=week-8 -o NFLWeather_2012_week_8.json
Week 6 crawl works perfect no issues
Week 7 Crawl returns nothing
ERROR: Spider error processing <GET https://nflweather.com/week/2012/week-7> (referer: None)
Traceback (most recent call last):
File "G:\ProgramFiles\MiniConda3\envs\WrkEnv\lib\site-packages\scrapy\utils\defer.py", line 279, in iter_errback
yield next(it)
Week 8 retrives 2 lines and errors out fro the rest
ERROR: Spider error processing <GET https://nflweather.com/week/2012/week-8> (referer: None)
Traceback (most recent call last):
File "G:\ProgramFiles\MiniConda3\envs\WrkEnv\lib\site-packages\scrapy\utils\defer.py", line 279, in iter_errback
yield next(it)
Any idea why these files fails and the other files have no issues?
The error lies with the windspeeds
variable, sometimes the weather data is missing and so the windspeeds
variable will be None
, then when you attempt to create the dictionary object you envoke windspeeds.strip()
which throws the exception.
You could solve this by doing a simple None
check when creating the dictionary, or you're free to do the check earlier, however fits your needs best. But here is a working example
scraped_info = {
'Year': self.Year,
'Game': self.Game,
'Datetime': Datetimes.strip(),
'awayTeam': awayTeams,
'homeTeam': homeTeams,
'TempProb': TempProbs,
'windspeeds': windspeeds.strip() if windspeeds is not None else "TBD",
'winddirection': winddirection.strip() if winddirection is not None else "TBD"
}
You will also notice that "working" example you provided with week-6
will now contain more results than before