Search code examples
pythondatabasescrapy

Scrapy: Where to init database connection, so it is available and accessible in spiders, pileines, and classes


I have a fairly standard Scrapy project, its dir structure looks like this

my_project
  scrapy.cfg
  my_project
    __init__.py
    items.py
    itemsloaders.py
    middlewares.py
    MyStatsCollector.py
    pipelines.py
    settings.py
    spiders
      __init__.py
      spider1.py
      spider2.py
      spider3.py

Right now, my database connection is placed in the my_project/pipelines.py:

import psycopg2
class SaveToPostgresPipeline:
    def __init__(self):
        hostname = ''
        username = ''
        password = ''
        database = ''

and the spiders works the way that they scrape data, send it to pipeline and it will save it to the database.

I would need now to fetch some data from the database also in spiders (spider1.py, spider2.py, spider3.py) and in MyStatsCollector.py.

Where should I set the database connection within the project, so ideally I init the database connection just once and then use it in spiders, pipelines, or in MyStatsCollector.py.

Right now, my only idea is to init the DB connection in each of these files, which doesn't looks very elegant. What's the best way to handle this?


Solution

  • If you do it in the spider and assign it to a spider attribute you will be able to access it in the spider (obviously) and in all components that get a spider instance including pipelines and stats collectors.