Search code examples
pythonpostgresqlscrapymultiple-insert

Batch/Bulk SQL insert in Scrapy Pipelines [PostgreSQL]


I am using my own pipeline to store the scrapped items into a PostgreSQL Database, I made an expansion a few days ago and I store the data into a 3 Databases now. So, I want to make the pipeline which inserting the data to be called every 100 items or it take the items and insert them 100 by 100.

The reason I want to make it fast and less headache on the DB Servers.


Solution

  • The solution was not that different from Anandhakumar's answer I created a global list in the settings file with a setter and getter method for it

    # This buffer for the bluk insertion
    global products_buffer
    
    products_buffer = []
    
    # Append product to the list
    def add_to_products_buffer(product):
      global products_buffer
      products_buffer.append(product)
    
    # Get the length of the product
    def get_products_buffer_len():
      global products_buffer
      return len(products_buffer)
    
    # Get the products list
    def get_products_buffer():
      global products_buffer
      return products_buffer
    
    # Empty the list
    def empty_products_buffer():
      global products_buffer
      products_buffer[:] = []
    

    Then I imported it in the pipeline

    from project.settings import products_buffer,add_to_products_buffer,get_products_buffer_len,empty_products_buffer,get_products_buffer
    

    and I append the item to the list every time the pipeline is called and i check if the length of the list is 100 I loop on the list to prepare a many inserts quires but the most important magic is to commit them all in one line, Don't commit in the loop or you won't gain anything and it will take long time to insert them all.

    def process_item(self, item, spider):  
        # Adding the item to the list
        add_to_products_buffer(item)
        # Check if the length is 100
        if get_products_buffer_len() == 100:
            # Get The list to loop on it
            products_list  = get_products_buffer()
            for item in products_list:
                # The insert query
                self.cursor.execute('insert query')
            try:
                # Commit to DB the insertions quires 
                self.conn.commit()
                # Emty the list
                empty_products_buffer()
            except Exception, e:
                # Except the error
    

    Also you can use executemany if you don't want to loop.