I put together a crawler to extract URLs found on a website and save them to a jsonline file:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
name = 'medscape_crawler'
allowed_domains = ['medscape.com']
start_urls = ['https://www.medscape.com/']
custom_settings = {
'ROBOTSTXT_OBEY': False,
'DOWNLOAD_DELAY': 2,
'FEEDS': {'medscape_links.jsonl': {'format': 'jsonlines',}},
'FEED_EXPORT_BATCH_ITEM_COUNT': 10,
'JOBDIR': 'crawl_state',
}
def parse(self, response):
yield {'url': response.url} # Save this page's URL
for href in response.css('a::attr(href)').getall():
if href.startswith('http://') or href.startswith('https://'):
yield response.follow(href, self.parse)
process = CrawlerProcess()
process.crawl(MySpider)
process.start()
The crawler successfully collects the links but does not populate the jsonline file with any outputs:
2023-05-28 21:20:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://portugues.medscape.com> (referer: https://www.medscape.com/)
2023-05-28 21:20:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://portugues.medscape.com>
{'url': 'https://portugues.medscape.com'}
The jsonline file remains empty. Adding 'FEED_EXPORT_BATCH_ITEM_COUNT': 10,
does not make earlier saves.
Any help would be greatly appreciated.
Thank you!
It does work, you may want to clean your active.json
file inside crawl_state
directory.
If you want to save in different files use FEED_URI_PARAMS.
custom_settings = {
'ROBOTSTXT_OBEY': False,
'DOWNLOAD_DELAY': 2,
'FEEDS': {'json_files/batch-%(batch_id)d.jsonl': {'format': 'jsonlines'}},
'FEED_EXPORT_BATCH_ITEM_COUNT': 10,
'JOBDIR': 'crawl_state',
}
If you pause your job then you may want to set overwrite
to be False
(if you're not saving in different files, I haven't tested it though).