Search code examples
pythonweb-scrapingscrapyweb-crawler

Scrapy not saving output to jsonline


I put together a crawler to extract URLs found on a website and save them to a jsonline file:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    name = 'medscape_crawler'
    allowed_domains = ['medscape.com']
    start_urls = ['https://www.medscape.com/']
    custom_settings = {
        'ROBOTSTXT_OBEY': False,
        'DOWNLOAD_DELAY': 2,
        'FEEDS': {'medscape_links.jsonl': {'format': 'jsonlines',}},
        'FEED_EXPORT_BATCH_ITEM_COUNT': 10,
        'JOBDIR': 'crawl_state',
    }

    def parse(self, response):
        yield {'url': response.url}  # Save this page's URL

        for href in response.css('a::attr(href)').getall():
            if href.startswith('http://') or href.startswith('https://'):
                yield response.follow(href, self.parse)

process = CrawlerProcess()
process.crawl(MySpider)
process.start()

The crawler successfully collects the links but does not populate the jsonline file with any outputs:

2023-05-28 21:20:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://portugues.medscape.com> (referer: https://www.medscape.com/)
2023-05-28 21:20:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://portugues.medscape.com>
{'url': 'https://portugues.medscape.com'}

The jsonline file remains empty. Adding 'FEED_EXPORT_BATCH_ITEM_COUNT': 10, does not make earlier saves.

Any help would be greatly appreciated.

Thank you!


Solution

  • It does work, you may want to clean your active.json file inside crawl_state directory.
    If you want to save in different files use FEED_URI_PARAMS.

    custom_settings = {
        'ROBOTSTXT_OBEY': False,
        'DOWNLOAD_DELAY': 2,
        'FEEDS': {'json_files/batch-%(batch_id)d.jsonl': {'format': 'jsonlines'}},
        'FEED_EXPORT_BATCH_ITEM_COUNT': 10,
        'JOBDIR': 'crawl_state',
    }
    

    If you pause your job then you may want to set overwrite to be False (if you're not saving in different files, I haven't tested it though).