Search code examples
sql-servert-sqlquery-performance

Mapping columns without knowing what's in the table


I have an odd situation where I will have data coming from various sources (all flat files, none of which are in my control, and no matter how many times I ask for a standard format, I get different column headers and different column orders). We do not have the manpower to manually go through these files to determine which columns are important. Each flat file will have between two and six "identification" columns. However, some of the columns, individually, are not unique, but their combinations can form unique keys. All told, each flat file can have somewhere around one hundred columns.

So, initially, I planned to load the data into a temp table and ask the user to identify which columns contained which data. Once I know that, I can process the file without issue. I would have the two to six columns to identify matches with existing records and the additional data that I am supposed to gather (all identified by the user).

I was then asked to add in the ability for the system to "recommend" which data columns are which. For that, my plan was a count. I would count how many nonempty values each column has and then count how many of those nonempty values match each of the six possible columns. From there, I can take a simple ratio to determine the likelihood that the data contained is of that particular type. There would be some overvaluing of columns that are not unique, but in general, it is working nicely. The problem is that it is very slow.

I created a metadata table that I am calling UploadedTableColumn which contains every column header of the source file and which column it maps to in the database. Here is my stored procedure to update the counts:

CREATE PROCEDURE stored_Procedure 
    @FileLoadID INT
AS
BEGIN
    DECLARE @SqlCommand NVARCHAR(MAX)

    DECLARE the_cursor CURSOR FAST_FORWARD FOR 
    SELECT N'UPDATE UploadedTableColumn SET NumberNonemptyRows = (SELECT COUNT(*) FROM ' + DestinationTableName + N' WHERE ISNULL(' + DestinationColumnName + N','''') <> ''''),' + CHAR(13)
        + N'NumberID1Rows = (SELECT COUNT(*) FROM ' + DestinationTableName + N' WHERE ISNULL(' + DestinationColumnName + N','''') IN (SELECT ID1 FROM ID1Table) AND ISNULL(' + DestinationColumnName + N','''') <> ''''),' + CHAR(13)
        + N'NumberID2Rows = (SELECT COUNT(*) FROM ' + DestinationTableName + N' WHERE ISNULL(' + DestinationColumnName + N','''') IN (SELECT ID2 FROM ID2Table) AND ISNULL(' + DestinationColumnName + N','''') <> ''''),' + CHAR(13)
        + N'NumberIDDateRows = (SELECT COUNT(*) FROM ' + DestinationTableName + N' WHERE IIF(ISDATE(' + DestinationColumnName + N')=1,IIF(CAST(' + DestinationColumnName + N' AS DATE) IN (SELECT IDDate FROM IDDateTable),1,0),0) = 1 AND ISNULL(' + DestinationColumnName + N','''') <> ''''),' + CHAR(13)
        + N'NumberID4Rows = (SELECT COUNT(*) FROM ' + DestinationTableName + N' WHERE ISNULL(' + DestinationColumnName + N', '''') IN (SELECT ID4 FROM ID4Table) AND ISNULL(' + DestinationColumnName + N','''') <> ''''),' + CHAR(13)
        + N'NumberID5Rows = (SELECT COUNT(*) FROM ' + DestinationTableName + N' WHERE ISNULL(' + DestinationColumnName + N', '''') IN (SELECT ID5 FROM ID5Table) AND ISNULL(' + DestinationColumnName + N','''') <> ''''),' + CHAR(13)
        + N'NumberID6Rows = (SELECT COUNT(*) FROM ' + DestinationTableName + N' WHERE ISNULL(' + DestinationColumnName + N', '''') IN (SELECT ID6 FROM ID6Table) AND ISNULL(' + DestinationColumnName + N','''') <> '''')' + CHAR(13)
        + N'WHERE DestinationTableName = ''' + DestinationTableName + N''' AND DestinationColumnName = ''' + DestinationColumnName + N''' AND FileLoadID = ' + CAST(@FileLoadID AS NVARCHAR) + N';' + CHAR(13) As SqlCommand
    FROM UploadedTableColumn
    WHERE FileLoadID = @FileLoadID

    OPEN the_cursor

    FETCH NEXT FROM the_cursor
    INTO @SqlCommand

    WHILE @@FETCH_STATUS = 0
    BEGIN
        EXECUTE(@SqlCommand)

        FETCH NEXT FROM the_cursor 
        INTO @SqlCommand
    END

    CLOSE the_cursor
    DEALLOCATE the_cursor
END

Is there a faster approach?


Solution

  • One small change that might help.

    You say you're holding every column of the source table in UploadedTableColumn - you don't need to do that, your cursor is looping through a lot of unnecessary columns. You can eliminate a lot with a pre-emptive column name match.

    So get a combined list of all possible ID columns from your ID1Table, ID2Table, etc, and only pull into UploadedTableColumn the ones that actually match a column in DestinationTableName.

    On the basis that there are probably no more than 6 columns in your source data that have a matching ID column name, you're now only checking those rather than all 100+.

    Of course, this doesn't help you if you've got people sending data without headers and no agreed format.

    Pseudo code to get the desired columns:

    SELECT name
    FROM sys.columns
    WHERE [object_id] = OBJECT_ID('DestinationTableName')
    AND Name IN
    (
    SELECT ID1 AS IDColumn FROM ID1Table
    UNION ALL
    SELECT ID2 AS IDColumn FROM ID2Table
    ...
    )