Search code examples
c#postgresqlbulkinsertpostgresql-performance

How to improve bulk inserts to Postgres DB


I currently have a service in C# that uses dapper to call a stored procedure that does 2 things: if the customer exists, it grabs the customer GUID and adds it to the CustomerInformations table; if the customer doesn't exist, then it inserts the customer and then returns the GUID and adds it to the CustomerInformations table.

Previously, inserts were taking around 1.75 million records an hour. Now it can barely get 200,000 records an hour. There's around 75 million records in my CustomerInformations table and I am looking to fixing the bottleneck.

For each Customer property, it's iteratively calling a stored procedure. Each stored procedure call could have 2 inserts into the database. First, adding the customer into the Customers table and then adding the property into the CustomerInformations table. I understand that this may not be the most ideal way to store the data but it's not something I can change.

C# service

foreach (var info request.Data)
{
    string sql = "add_one_by_customer";
    object parameters = new
    {
        p_customer_first_name = info.FirstName,
        p_customer_last_name = info.LastName,
        p_customer_property_name = info.PropertyName,
        p_customer_property_value = info.PropertyValue
    };

    try
    {
        await db.ExecuteAsync(sql, parameters, transaction: transaction, commandType: CommandType.StoredProcedure);
    }
    catch (Exception e)
    {
        throw new Exception($"Failed to insert");
    }
}

Postgres stored procedure:

CREATE OR REPLACE PROCEDURE add_one_by_customer(
    p_customer_first_name  VARCHAR,
    p_customer_last_name  VARCHAR,
    p_customer_property_name  VARCHAR,
    p_customer_property_value  VARCHAR,
    )
    LANGUAGE plpgsql
AS $procedure$
DECLARE p_customer_id uuid;
        p_current_item_value varchar;   
begin   
    SELECT INTO p_customer_id,
                customer_id
    FROM customers
    WHERE customer_first_name = p_customer_first_name AND
          customer_last_name = p_customer_last_name
    limit 1;
                           
    
    IF (p_customer_id IS NULL) THEN  
        begin               
            INSERT INTO customers(customer_first_name, customer_last_name)
            VALUES (p_customer_first_name, p_customer_last_name) RETURNING  customer_id into p_customer_id;
            EXCEPTION WHEN unique_violation THEN
            p_customer_id  = (SELECT custmomer_id 
                              FROM  customers
                              WHERE customer_first_name = p_customer_first_name AND
                                    customer_last_name = p_customer_last_name
        END;
    end if;    
   
    p_current_item_value := (select property_value
                             from customer_informations
                             where customer_id = p_customer_id AND
                                   customer_property_name = p_customer_property_name);
  

   
    if (p_current_item_value is NULL) THEN
        INSERT INTO customer_informations(customer_id, customer_property_name, customer_property_value)
        VALUES (p_customer_id, p_customer_property_name, p_customer_property_value);
    elseif (p_current_item_value is not null AND p_current_item_value != p_item_value) then
        UPDATE customer_informations 
        SET customer_property_value = p_current_item_value
        WHERE  customer_id = p_customer_id ;        
    end if;
end; $procedure$;

Currently my CustomerInformations table has a unique constraint on Customer_Id, Customer_property_name.

Things I've tried to enhance:

  • Parallelize in the service (which is why you see the unique violation exception line in the stored procedure) this does speed it up but not enough.
  • I am considering dropping the unique constraint and the index but I am not sure how easy it'll be to clean up duplicates (other individuals interact with the database)

Any tips or suggestions would be greatly appreciated.

Customer_information unique constraint:

CONSTRAINT ux_customer_informations UNIQUE (customer_id, customer_property_name)

Customers unique constraint:

CONSTRAINT ux_customers UNIQUE (customer_firstname, customer_lastname)

Solution

  • Your current procedure is hugely inefficient. See:

    Avoid a nested code block with error handling, that's very expensive. Can be done properly with the "SELECT or INSERT" technique I use below. See:

    The second part is an UPSERT in disguise. Also much cheaper now:

    CREATE OR REPLACE PROCEDURE dd_one_by_customer(
          p_customer_first_name      text
        , p_customer_last_name       text
        , p_customer_property_name   text
        , p_customer_property_value  text
          )
      LANGUAGE plpgsql AS
    $proc$
    DECLARE
       p_customer_id uuid;
       p_current_item_value text;
    BEGIN
       LOOP
          SELECT customer_id
          FROM   customers
          WHERE  customer_first_name = p_customer_first_name
          AND    customer_last_name = p_customer_last_name
          INTO   p_customer_id;
    
          EXIT WHEN FOUND;
          
          INSERT INTO customers
                 (  customer_first_name,   customer_last_name)
          VALUES (p_customer_first_name, p_customer_last_name)
          ON     CONFLICT (customer_first_name, customer_last_name) DO NOTHING
          RETURNING customer_id
          INTO   p_customer_id;
    
          EXIT WHEN FOUND;
       END LOOP;
    
       INSERT INTO customer_informations
              (  customer_id,   customer_property_name,   customer_property_value)
       VALUES (p_customer_id, p_customer_property_name, p_customer_property_value)
       ON     CONFLICT (customer_id, customer_property_name) DO UPDATE
       SET    customer_property_value = EXCLUDED.customer_property_value
       WHERE  customer_property_value IS DISTINCT FROM p_current_item_value;
    END
    $proc$;
    

    This requires a UNIQUE constraint on each of the two tables - exactly the ones you declared (ux_customer_informations & ux_customers).

    If neither customer_property_value nor p_current_item_value can be null, simplify the final WHERE clause to:

    ...
    WHERE  customer_property_value <> p_current_item_value;