Search code examples
c#sql-serverentity-frameworkbulkinserttable-valued-parameters

What is the difference between inserting data using Sql insert statements and SqlBulkCopy?


I have a problem of inserting huge amount of data to SQL server.

Previously I was using Entity framework, but it was damn slow for just 100K root level records ( containing separately two distinct collections, where each one is further operating on 200K records roughly ) = roughly 500K-600K records in memory. Here I applied all optimization ( e.g AutoDetectChangesEnabled = false, and recreated and disposed the context after each batch. )

I rejected the solution, and used BulkInsert that's substantially very fast and much efficient. Was just able to insert 100K records in a minute or so.

But the main problem is getting back primary keys from newly inserted records. For this , I am thinking to write stored procedure which could operate on TVP ( i.e in memory data table holding all root level 100K records ). and there inside I would use OUTPUT INSERTED.Id in order to get all primary keys inside application).

So, how can I compare this approach ( i.e Sql Insert query inside stored procedure ) with SqlBulkCopy approach.

Any idea if somehow, I can get all primary keys back after SqlBulkCopy operation? Or something concrete regarding OUTPUT Inserted.Id would return all correct new keys in application.

PS : I don't want to create any staging table during the process. This is just an overhead.


Solution

  • Here's an example based on discussion in the comments / expanding on the idea mentioned here: Possible to get PrimayKey IDs back after a SQL BulkCopy?

    i.e.

    • Do a bulk upload from C# to a temp table in SQL
    • Use Sql to copy the data from the temp table to the actual table (at which point the IDs are generated), and return the IDs.

    I've not had a chance to test this, but hopefully this will help:

    //using System.Data.SqlClient;
    //using System.Collections.Generic;
    
    public DataTable CreatePersonDataTable(IEnumerable<PersonDTO> people) 
    {
        //define the table
        var table = new DataTable("People");
        table.Columns.Add(new DataColumn("Name", typeof(string)));
        table.Columns.Add(new DataColumn("DOB", typeof(DateTime)));
        //populate it
        foreach (var person in people)
        {
            table.Rows.Add(person.Name, person.DOB);
        }
        return table;
    }
    
    readonly string ConnectionString; //set this in the constructor
    readonly int BulkUploadPeopleTimeoutSeconds = 600; //default; could override in constructor
    public IEnumerable<long> BulkUploadPeople(IEnumerable<PersonDTO> people) //you'd want to break this up a bit; for simplicty I've bunged everything into one big method
    {
        var data = CreatePersonDataTable(people);
        using(SqlConnection con = new SqlConnection(ConnectionString)) 
        {
            con.Open(); //keep same connection open throughout session
            RunSqlNonQuery(con, "select top 0 Name, DOB into #People from People");
            BulkUpload(con, data, "#People");
            var results = TransferFromTempToReal(con, "#People", "People", "Name, DOB", "Id");
            RunSqlNonQuery(con, "drop table #People");  //not strictly required since this would be removed when the connection's closed as it's session scoped; but best to keep things clean.
        }
        return results;
    }
    private void RunSqlNonQuery(SqlConnection con, string sql)
    {
        using (SqlCommand command = con.CreateCommand())
        {
            command.CommandText = sql;
            command.ExecuteNonQuery();      
        }
    }
    private void BulkUpload(SqlConnection con, DataTable data, string targetTable)
    {
        using(SqlBulkCopy bulkCopy = new SqlBulkCopy(con))
        {
            bulkCopy.BulkCopyTimeout = 600; //define this in your config 
            bulkCopy.DestinationTableName = targetTable; 
            bulkCopy.WriteToServer(data);         
        }
    }
    private IEnumerable<long> TransferFromTempToReal(SqlConnection con, string tempTable, string realTable, string columnNames, string idColumnName)
    {
        using (SqlCommand command = con.CreateCommand())
        {
            command.CommandText = string.Format("insert into {0} output inserted.{1} select {2} from {3}", realTable, idColumnName, columnNames, tempTable);
            using (SqlDataReader reader = command.ExecuteReader()) 
            {
                while(reader.Read()) 
                {
                    yield return r.GetInt64(0);
                }
            }
        }
    }
    

    Whilst in your question you've added that you don't want to use a staging table as it's an "overhead"... please try. You may find that the small overhead of creating a staging table is less than the performance gain in using this method.

    Obviously it's not going to be as fast as inserting and ignoring the returned ids; but if that's your requirement, in the absence of other answers, this may be the best alternative.