Search code examples
pythonhtmlscrapyscreen-scrapingparsel

Using parsel in Scrapy project


I'm trying to use the parsel library to scrape elements from a html file in a Scrapy project. This is my spider codes, named 123Spider :

import scrapy 

import requests

class 123Spider(scrapy.Spider):

    name = "123Spider"
    start_url = [
    'file://URI'
]

    def parse(self, response):

        for commentSelector in response.css("div._li"):
            yield {
                'comment': commentSelector.css('#js_ajn > p').extract(),
        }

when I run scrapy crawl 123Spider -o output.json from command line, it exports an empty JSON file. Terminal shows this process:

2018-01-03 14:44:20 [scrapy.core.engine] DEBUG: Crawled (400) <GET  https://raw.githubusercontent.com/robots.txt> (referer: None)
2018-01-03 14:44:20 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://raw.githubusercontent.com/xxx.html> (referer: None)
2018-01-03 14:44:20 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://raw.githubusercontent.com/xxx.html>: HTTP status code is not handled or not allowed

Questions:

  1. Why is Error 404 & 400 returned crawling the .html file? It worked fine both when I run a pure parsel .py file, as well as within scrapy shell. (the html file is >10MB)
  2. How to correctly nestle parsel elements within my 123Spider class?

Searched existing questions but none match my scenario.

Update: The aim is to parse a .html file that already exists in my spider project structure. However when crawling file://URI, terminal shows no pages crawled. There's no typo in my URI, tested with scrapy shell.

2018-01-04 14:40:14 [scrapy.core.engine] INFO: Spider opened
2018-01-04 14:40:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-04 14:40:14 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-01-04 14:40:14 [scrapy.core.engine] INFO: Closing spider (finished)
2018-01-04 14:40:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 1, 4, 21, 40, 14, 392659),
'log_count/DEBUG': 1,
'log_count/INFO': 7,
'memusage/max': 55623680,
'memusage/startup': 55619584,
'start_time': datetime.datetime(2018, 1, 4, 21, 40, 14, 374933)}
2018-01-04 14:40:14 [scrapy.core.engine] INFO: Spider closed (finished)

Solution

  • Scrapy already uses parsel selectors by default, so there is no need for you to even import it - response.xpath() and response.css() use the methods of the underlying parsel selector.
    Knowing that, you can just remove the 4 lines importing Selector and creating an instance of it.

    The real problem seems to be the 404, which simply means the document you were trying to access wasn't found.
    My first guess would be that you have a typo in your start_urls. If that's not the case, you'll need to share the actual url you're trying to scrape.

    The 400 error is just scrapy trying and failing to access the robots.txt file. You could disable the RobotsTxtMiddleware to stop this from happening, but there is no real benefit, it will cause you no problems and can be ignored.