Search code examples
pythonweb-scrapingscrapy

Scrapy working on home page but not any other page


scrapy newbie here. I am attempting to scrape data off https://www.aims.gov.au, but more specifically https://weather.aims.gov.au/#/station/4. However, when I attempt to scrape data from the station/4 page I do not get anything back, compared to when I scrape their homepage the aims.gov.au page, I can retrieve basically anything. Any idea on why this may be? Here is my code, hopefully someone could help see where I am going wrong.

The first code segment scrapes a random heading just to show that I can scrape from the website, but then when I move to my desired page (second code segment) I can't scrape a single thing...

All settings are default for the other generated scrapy files.

Code for home page (testing to see if I can scrape here):

class GBRspider(scrapy.Spider):
    name = 'GBRspider'
    allowed_domains = ['weather.aims.gov.au']
    start_urls = ['https://www.aims.gov.au']



    def parse(self, response):
        data = response
        yield{
            'temp' : data.css('h1.banner-title::text').get()
        }

This gives me temp : "Australia's tropical marine research agency"

Code for desired page:

class GBRspider(scrapy.Spider):
    name = 'GBRspider'
    allowed_domains = ['weather.aims.gov.au']
    start_urls = ['https://weather.aims.gov.au/#/station/4']



    def parse(self, response):
        data = response
        yield{
            'temp' : data.css('h1.ng-binding::text').get()
        }

This gives me temp : None, where it should be giving me Davies Reef

Thank you


Solution

  • THis is because the information used to render the home page is all contained in the initial response to your http request to home page URL.

    The other url: https://weather.aims.gov.au/#/station/4 gets the information it needs to render the page from an api request to https://api.aims.gov.au/weather/station/4 which yields a json response that the server then uses to render the page. So in order to get the information you seek, all you have to do is send a request to the api url instead.

    For example:

    import scrapy
    
    class GBRspider(scrapy.Spider):
        name = 'GBRspider'
        allowed_domains = ['aims.gov.au']
        start_urls = ['https://api.aims.gov.au/weather/station/4']
    
        def parse(self, response):
            data = response.json()
            data["site_name"]
            yield{
                'temp' : data["site_name"]
            }
    

    OUTPUT

    2023-08-26 12:55:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.aims.gov.au/weather/station/4> (referer: None)
    2023-08-26 12:55:59 [scrapy.core.scraper] DEBUG: Scraped from <200 https://api.aims.gov.au/weather/station/4>
    {'temp': 'Davies Reef'}
    2023-08-26 12:55:59 [scrapy.core.engine] INFO: Closing spider (finished)