python postgresql scrapy multiple-insert

Batch/Bulk SQL insert in Scrapy Pipelines [PostgreSQL]

I am using my own pipeline to store the scrapped items into a PostgreSQL Database, I made an expansion a few days ago and I store the data into a 3 Databases now. So, I want to make the pipeline which inserting the data to be called every 100 items or it take the items and insert them 100 by 100.

The reason I want to make it fast and less headache on the DB Servers.

Solution

The solution was not that different from Anandhakumar's answer I created a global list in the settings file with a setter and getter method for it

# This buffer for the bluk insertion
global products_buffer

products_buffer = []

# Append product to the list
def add_to_products_buffer(product):
  global products_buffer
  products_buffer.append(product)

# Get the length of the product
def get_products_buffer_len():
  global products_buffer
  return len(products_buffer)

# Get the products list
def get_products_buffer():
  global products_buffer
  return products_buffer

# Empty the list
def empty_products_buffer():
  global products_buffer
  products_buffer[:] = []

Then I imported it in the pipeline

from project.settings import products_buffer,add_to_products_buffer,get_products_buffer_len,empty_products_buffer,get_products_buffer

and I append the item to the list every time the pipeline is called and i check if the length of the list is 100 I loop on the list to prepare a many inserts quires but the most important magic is to commit them all in one line, Don't commit in the loop or you won't gain anything and it will take long time to insert them all.

def process_item(self, item, spider):  
    # Adding the item to the list
    add_to_products_buffer(item)
    # Check if the length is 100
    if get_products_buffer_len() == 100:
        # Get The list to loop on it
        products_list  = get_products_buffer()
        for item in products_list:
            # The insert query
            self.cursor.execute('insert query')
        try:
            # Commit to DB the insertions quires 
            self.conn.commit()
            # Emty the list
            empty_products_buffer()
        except Exception, e:
            # Except the error

Also you can use executemany if you don't want to loop.