scrapy newbie here. I am attempting to scrape data off https://www.aims.gov.au, but more specifically https://weather.aims.gov.au/#/station/4. However, when I attempt to scrape data from the station/4 page I do not get anything back, compared to when I scrape their homepage the aims.gov.au page, I can retrieve basically anything. Any idea on why this may be? Here is my code, hopefully someone could help see where I am going wrong.
The first code segment scrapes a random heading just to show that I can scrape from the website, but then when I move to my desired page (second code segment) I can't scrape a single thing...
All settings are default for the other generated scrapy files.
Code for home page (testing to see if I can scrape here):
class GBRspider(scrapy.Spider):
name = 'GBRspider'
allowed_domains = ['weather.aims.gov.au']
start_urls = ['https://www.aims.gov.au']
def parse(self, response):
data = response
yield{
'temp' : data.css('h1.banner-title::text').get()
}
This gives me temp : "Australia's tropical marine research agency"
Code for desired page:
class GBRspider(scrapy.Spider):
name = 'GBRspider'
allowed_domains = ['weather.aims.gov.au']
start_urls = ['https://weather.aims.gov.au/#/station/4']
def parse(self, response):
data = response
yield{
'temp' : data.css('h1.ng-binding::text').get()
}
This gives me temp : None, where it should be giving me Davies Reef
Thank you
THis is because the information used to render the home page is all contained in the initial response to your http request to home page URL.
The other url: https://weather.aims.gov.au/#/station/4 gets the information it needs to render the page from an api request to https://api.aims.gov.au/weather/station/4 which yields a json response that the server then uses to render the page. So in order to get the information you seek, all you have to do is send a request to the api url instead.
For example:
import scrapy
class GBRspider(scrapy.Spider):
name = 'GBRspider'
allowed_domains = ['aims.gov.au']
start_urls = ['https://api.aims.gov.au/weather/station/4']
def parse(self, response):
data = response.json()
data["site_name"]
yield{
'temp' : data["site_name"]
}
OUTPUT
2023-08-26 12:55:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.aims.gov.au/weather/station/4> (referer: None)
2023-08-26 12:55:59 [scrapy.core.scraper] DEBUG: Scraped from <200 https://api.aims.gov.au/weather/station/4>
{'temp': 'Davies Reef'}
2023-08-26 12:55:59 [scrapy.core.engine] INFO: Closing spider (finished)