I have a fairly standard Scrapy project, its dir structure looks like this
my_project
scrapy.cfg
my_project
__init__.py
items.py
itemsloaders.py
middlewares.py
MyStatsCollector.py
pipelines.py
settings.py
spiders
__init__.py
spider1.py
spider2.py
spider3.py
Right now, my database connection is placed in the my_project/pipelines.py
:
import psycopg2
class SaveToPostgresPipeline:
def __init__(self):
hostname = ''
username = ''
password = ''
database = ''
and the spiders works the way that they scrape data, send it to pipeline and it will save it to the database.
I would need now to fetch some data from the database also in spiders (spider1.py
, spider2.py
, spider3.py
) and in MyStatsCollector.py
.
Where should I set the database connection within the project, so ideally I init the database connection just once and then use it in spiders, pipelines, or in MyStatsCollector.py.
Right now, my only idea is to init the DB connection in each of these files, which doesn't looks very elegant. What's the best way to handle this?
If you do it in the spider and assign it to a spider attribute you will be able to access it in the spider (obviously) and in all components that get a spider instance including pipelines and stats collectors.