Search code examples
herokucommand-linescrapysettings

How to run ScrapyRT over Heroku with custom settings?


I have a Scrapy project and over it, I have ScrapyRT to create an API. First, I deployed the application in Heroku with the default settings and with the Procfile as follows:

web: scrapyrt -i 0.0.0.0 -p $PORT

everything is fine so far, it runs as expected.

The Scrapy project has a pipeline that sends the scraped items to a mongo database. That works fine as well.

Now, since I am already saving the scraped data into a database, my intention was to create an additional resource to handle the get requests so ScrapyRT checks in the database if the item was scrapped before, and returns it instead of running the spider. According to the documentation for ScrapyRT, in order to add a new resource, I needed to pass custom settings through the command line (PowerShell in windows) like this:

scrapyrt -S nist_scraper.scrapyrt.settings

where nist_scraper is the name of the project, scrapyrt is a subdirectory inside the project, and settings is the name of the python file where the settings are located.

# nist_scraper/scrapyrt/settings.py

RESOURCES = {
    'crawl.json': 'nist_scraper.scrapyrt.resources.CheckDatabaseBeforeCrawlResource',
}
# resourse.py
# custom

import os
import json

from pymongo import MongoClient
from dotenv import load_dotenv
load_dotenv()

from scrapyrt.resources import CrawlResource



class CheckDatabaseBeforeCrawlResource(CrawlResource):

    def render_GET(self, request, **kwargs):

        # Get the url parameters
        api_params = dict(
            (name.decode('utf-8'), value[0].decode('utf-8'))
            for name, value in request.args.items()
        )
            
        try:
            cas = json.loads(api_params["crawl_args"])["cas"]
            collection_name = "substances"
            client = MongoClient(os.environ.get("MONGO_URI"))
            db = client[os.environ.get("MONGO_DB")]
        except:
            return super(CheckDatabaseBeforeCrawlResource, self).render_GET(
                request, **kwargs)

        substance = db[collection_name].find_one({"cas":cas}, {"_id":0})
        if substance:
            response = {
            "status": "ok",
            "items": [substance],
            }                #<== Here is supposed to be the metadata but is gone on purpose

            return response
        
        return super(CheckDatabaseBeforeCrawlResource, self).render_GET(
        request, **kwargs)

Again, in local, once I sent the get request

{{BASE_URL}}crawl.json?spider_name=webbook_nist&start_requests=true&crawl_args={"cas":"74828"}

I get the desired behavior, the resource sends the item from the database and not from the spider in the Scrapy project. I know the item came from the database because I modified the response that is returned by ScrapyRT and removed all the metadata.

However, here there is the issue. I updated the same local project to Heroku to override the original one mentioned at the beginning which worked fine and changed the Procfile to:

web: scrapyrt -S nist_scraper.scrapyrt.settings -i 0.0.0.0 -p $PORT

But when I sent the same get request, ScrapyRT calls the spider and does not check if the item is in the database. To make it clear, the database is the same, and the item is indeed recorded in that database. The response sent has the metadata I removed from the custom resource.

I am not proficient at either Heroku not ScrapyRT but I am assuming the issue is that Heroku is not adding my custom settings when starting the API so the ScrapyRT module is running its default ones which always scrap the website using the spider.

The project is live here: https://nist-scrapyrt.herokuapp.com/crawl.json?spider_name=webbook_nist&start_requests=true&crawl_args={%22cas%22:%227732185%22}

And there is a GitHub repo here: https://github.com/oscarcontrerasnavas/nist-webbook-scrapyrt-spider

As far as I know, if I do not add the custom settings through the command line arguments, the default settings from the scrapy.cfg are overwritten by the default for ScrapyRT.

I want the same behavior as the local environment but over Heroku. I do not want to run the spider every time because I know it is less "expensive" to pull the info from the database.

Any suggestion?


Solution

  • The implementation shown in this question is correct, there was a typo with the environment variables on Heroku. If you have questions on how to do it yourself, you can leave a comment.