Search code examples
sqldatabasedatabase-designdata-migration

Data Import Question: Should I use a cursor?


I'm currently working on a SQL import routine to import data from a legacy application to a more modern robust system. The routine simply imports data from a flat-file legacy table (stored as a .csv file) into SQL Server that follows the classic order/order-detail pattern. Here's what both tables look like:

**LEGACY_TABLE**  
Cust_No  
Item_1_No  
Item_1_Qty  
Item_1_Prc  
Item_2_No  
Item_2_Qty  
Item_2_Prc  
...  
Item_7_No    
Item_7_Qty  
Item_7_Prc  

As you can see, the legacy table is basically a 22 column spreadsheet that is used to represent a customer, along with up to 7 items and their quantity and purchase price, respectively.

The new table(s) look like this:

**INVOICE**  
Invoice_No  
Cust_No

**INVOICE_LINE_ITEM**  
Invoice_No  
Item_No  
Item_Qty  
Item_Prc  

My quick-and-dirty approach has been to create a replica of the LEGACY_TABLE (let's call it LEGACY_TABLE_SQL) in SQL Server. This table will be populated from the .csv file using a database import that is already built into the application.
From there, I created a stored procedure to actually copy each of the values in the LEGACY_TABLE_SQL table to the INVOICE/INVOICE_LINE_ITEM tables as well as handle the underlying logical constraints (i.e. performing existence tests, checking for already open invoices, etc.). Finally, I've created a database trigger that calls the stored procedure when new data is inserted into the LEGACY_TABLE_SQL table.

The stored procedure looks something like this:

CREATE PROC IMPORT_PROCEDURE 
@CUST_NO 
@ITEM_NO  
@ITEM_QTY  
@ITEM_PRC  

However, instead of calling the procedure once, I actually call the stored procedure seven times (once for each item) using a database trigger. I only execute the stored procedure when the ITEM_NO is NOT NULL, to account for blank items in the .csv file. Therefore, my trigger looks like this:

CREATE TRIGGER IMPORT_TRIGGER  
if ITEM_NO_1 IS NOT NULL  
begin  
exec IMPORT_PROCEDURE (CUST_NO,ITEM_NO_1, ITEM_QTY_1, ITEM_PRC_1)  
end  

...so on and so forth.

I'm not sure that this is the most efficient way to accomplish this task. Does anyone have any tips or insight that they wouldn't mind sharing?


Solution

  • I would separate the import process from any triggers. A trigger is useful if you're going to have rows being constantly added to the import table from a constantly running, outside source. It doesn't sound like this is your situation though, since you'll be importing an entire file at once. Triggers tend to hide code and can be difficult to work with in some situations.

    How often are you importing these files?

    I would have an import process that is mostly stand-alone. It might use stored procedures or tables in the database, but I wouldn't use triggers. A simple approach would be something like below. I've added a column to the Legacy_Invoices (also renamed to something that's more descriptive) so that you can track when items have been imported and from where. You can expand this to track more information if necessary.

    Also, I don't see how you're tracking invoice numbers in your code. I've assumed an IDENTITY column in the Legacy_Invoices. This is almost certainly insufficient since I assume that you're creating invoices in your own system as well (outside of the legacy system). Without knowing your invoice numbering scheme though, it's impossible to give a solution there.

    BEGIN TRAN
    
    DECLARE
        @now DATETIME = GETDATE()
    
    UPDATE Legacy_Invoices
    SET
        import_datetime = @now
    WHERE
        import_status = 'Awaiting Import'
    
    INSERT INTO dbo.Invoices (invoice_no, cust_no)
    SELECT DISTINCT invoice_no, cust_no
    FROM
        Legacy_Invoices
    WHERE
        import_datetime = @now
    
    UPDATE Legacy_Invoices
    SET
        import_status = 'Invoice Imported'
    WHERE
        import_datetime = @now
    
    INSERT INTO dbo.Invoice_Lines (invoice_no, item_no, item_qty, item_prc)
    SELECT
        invoice_no,
        item_no_1,
        item_qty_1,
        item_prc_1
    FROM
        Legacy_Invoices LI
    WHERE
        import_datetime = @now AND
        import_status = 'Invoice Imported' AND
        item_no_1 IS NOT NULL
    
    UPDATE Legacy_Invoices
    SET
        import_status = 'Item 1 Imported'
    WHERE
        import_datetime = @now AND
        import_status = 'Invoice Imported'
    
    <Repeat for item_no_2 through 7>
    
    COMMIT TRAN
    

    Here's a big caveat though. While cursors are normally not desirable in SQL and you want to use set-based processing versus RBAR (row by agonizing row) processing, data imports are often an exception.

    The problem with the above is that if one row fails, that whole import step fails. Also, it's very difficult to run a single entity (invoice plus line items) through business logic when you're importing them in bulk. This is one place where SSIS really shines. It's extremely fast (assuming that you set it up properly), even when importing one entity at a time. You can then put all sorts of error-handling in it to make sure that the import runs smoothly. One import row has an erroneous invoice number? No problem, mark it as an error and move on. A row has item# 2 filled in, but no item#1 or has a price without a quantity? No problem, mark the error and move on.

    For a single import I might stick with the code above (adding in appropriate error handling of course), but for a repeating process I would almost certainly use SSIS. You can import millions of rows in seconds or minutes even with individual error handling on each business entity.

    If you have any problems with getting SSIS running (there are tutorials all over the web and on MSDN at Microsoft) then post any problems here and you should get quick answers.