Search code examples
scrapyscrapy-middleware

How to build own middleware in Scrapy?


I'm just starting to learn Scrapy and I have such a question. for my "spider" I have to take a list of urls (start_urls) from the google sheets table and I have this code:

import gspread
from oauth2client.service_account import ServiceAccountCredentials

scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name('token.json', scope)


client = gspread.authorize(creds)
sheet = client.open('Sheet_1')
sheet_instance = sheet.get_worksheet(0)
records_data = sheet_instance.col_values(col=2)

for link in records_data:
    print(link)
    ........

How do I configure the middleware so that when the spider (scrappy crawl my_spider) is launched, links from this code are automatically substituted into start_urls? perhaps i need to create a class in middlewares.py? I will be grateful for any help, with examples. it is necessary that this rule applies to all new spiders, generating a list from a file in start_requests (for example start_urls = [l.strip() for an open string('urls.txt ').readline()]) is not convenient...


Solution

  • Read this

    spider.py:

    import scrapy
    
    
    class ExampleSpider(scrapy.Spider):
        name = 'example'
    
        custom_settings = {
            'SPIDER_MIDDLEWARES': {
                'tempbuffer.middlewares.ExampleMiddleware': 543,
            }
        }
        
        def parse(self, response):
            print(response.url)
    

    middlewares.py:

    class ExampleMiddleware(object):
        def process_start_requests(self, start_requests, spider):
            # change this to your needs:
            with open('urls.txt', 'r') as f:
                for url in f:
                    yield scrapy.Request(url=url)
    
    

    urls.txt:

    https://example.com
    https://example1.com
    https://example2.org
    

    output:

    [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example2.org> (referer: None)
    [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com> (referer: None)
    [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example1.com> (referer: None)
    https://example2.org
    https://example.com
    https://example1.com