Search code examples
pythonscrapyscrapy-splash

Parsing output from scrapy splash


enter image description here

I'm testing out a splash instance with scrapy 1.6 following https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash and https://aaqai.me/notes/scrapy-splash-setup. My spider:

import scrapy
from scrapy_splash import SplashRequest
from scrapy.utils.response import open_in_browser

class MySpider(scrapy.Spider):
    start_urls = ["http://yahoo.com"]
    name = 'mytest'

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, endpoint='render.html', args={'wait': 7.5},)

    def parse(self, response):
        # response.body is a result of render.html call; it
        # contains HTML processed by a browser.
        open_in_browser(response)
        return None

The output opens up in notepad rather than a browser. How can I open this in a browser?


Solution

  • If you are using the splash middleware and everything the splash response goes into the regular response object with you can access via response.css and response.xpath. Depending on what endpoint you use you can execute JavaScript and other stuff.

    If you need to do moving around a page and other stuff you will need to write a LUA script to execute with the proper endpoint. As far as parsing the output it automatically goes into the response object.

    Get rid of open_in_browser I'm not exactly sure what you are doing but if all you want to do is parse the page you can do so like so

    body = response.css('body').extract_first()
    links = response.css('a::attr(href)').extract()
    

    If you could please clarify your question most people don't want to look in links to try and guess what your having trouble with.

    Update for clarified question:

    It sounds like you may want scrapy shell with Splash this will enable you to experiment with selectors:

    scrapy shell 'http://localhost:8050/render.html?url=http://page.html&timeout=10&wait=0.5'
    

    In order to access Splash in a browser instance simply go to http://0.0.0.0:8050/ you input the URL in there. I'm not sure about the method in the tutorial but this is how you can interact with the Splash session.