python web-scraping scrapy web-crawler scrapy-pipeline

Where should I bind the db/redis connection to on scrapy?

Sorry to disturb you guys. This is bad question, seems what really confused me is how ItemPipeline works in scrapy. I'll close it and start a new question.

Where should I bind the db/redis connection to on scrapy, Spider or Pipeline.

In the scrapy document, mongo db connection is bind on Pipeline. But it could be also be bound to the Spider(It's also what extension scrapy-redis does). The later solution brings the benefit that the spider is accessible in more places besides pipeline, like middlewares.

So, which is the better way to do it?

I'm confused about that pipelines are run in parallel (this is what the doc says). Does it mean there're multiple instances of MyCustomPipeline?

Besides, connection pool of redis/db is preferred?

I just lack the field experience to make the decision. Need your help. Thanks in advance.

As the doc says, ItemPipeline is run in parallel. How? Are there duplicate instances of the ItemPipeline run in threads. (I noticed FilesPipeline uses deferred thread to save files into s3). Or there's only one instance of each pipeline and runs in the main event loop. If it's the later case, the connection pool doesn't seems to help. Cause when you use a redis connection, it's blocked. Only one connection could be used at the same time.

Solution

Understanding how scrapy architecture is more important here. Look at the below diagram

Spiders

Spiders are custom classes written by Scrapy users to parse responses and extract items (aka scraped items) from them or additional URLs (requests) to follow. Each spider is able to handle a specific domain (or group of domains).

Item Pipeline

The Item Pipeline is responsible for processing the items once they have been extracted (or scraped) by the spiders. Typical tasks include cleansing, validation and persistence (like storing the item in a database).

When you understand the above architecture diagram, Scraper classes are used to scrape the website and item pipeline classes are used to process the items (scraped requests).

There are 2 scenarios here:

When you get the urls from any database

Here, in order to scrape websites, you need urls of the website. If those urls are stored in any database then it's better to bind the database connection objects to the scraper classes so that those can be fetched dynamically.

When you want to process the scraped items - Store data etc..

Here, you basically bind the database connection object to the Item Pipeline so that we can directly store the scraped data to the database.

Both binding the database connections to Scraper class and Pipeline class are correct depending on the scenario.

Question 2:

Connection pool of redis/db is preferred?

Yes, connection pool to any database is always preferred.

The connection pool maintains a generally steady-state collection of valid/open connections, assume 10. When the application needs to run a query or do an update, it “borrows” a connection from the pool by “opening” a connection. When it’s done, it “closes” the connection, which returns it to the pool for use by next request. Since the connection was already open, there is no overhead to obtaining the connection.

Source :https://qr.ae/pNs8jA