Search code examples
sql-servert-sqloptimizationstored-procedures

How to optimize T-SQL stored procedure to search thru a source table and insert into a destination table for new records


I created the below stored procedure to pull data from a source table and insert only "new records" into a destination table. I need to run this every minute. The challenge is that the table stores "images" with over 2 million rows.

When I run the stored procedure it is still running over 1 hour and 22 minutes. I've already moved all images over to this table so there are no new images to pull over so I'm assuming it's still reading the table for new records. The job is not being blocked and is still runnable.

Is there any way to optimize this stored procedure so that it pulls only new records. My goal would be to optimize so that we can run this on a re-occuring schedule. The business requirements are to run this every minute, but based on the current outcome that is not possible.

I've included the source and target table schema so you can see all columns that I have to work with. I am not sure if I can use the created_date in the source table to filter the data so it doesn't look thru all the data. There is also an identity column (acc_image_id) in the source table. I'm not sure how I should modify the stored procedure to pull data more quickly based on the max(identity).

FYI, I can add columns to the target table if it makes sense to add a column to help filter this data further so that it only looks for recent images.

Source table:

CREATE TABLE [dbo].[acc_image]
(
    [acc_image_id] [int] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
    [acc_id] [int] NULL,
    [image_type_id] [int] NOT NULL,
    [data_format] [varchar](10) NOT NULL,
    [label] [varchar](50) NOT NULL,
    [description] [varchar](255) NULL,
    [image_width] [smallint] NULL,
    [image_height] [smallint] NULL,
    [image_color_depth] [tinyint] NULL,
    [image_thumbnail] [image] NOT NULL,
    [data] [image] NOT NULL,
    [notes] [text] NULL,
    [acc_specimen_id] [int] NULL,
    [acc_slide_id] [int] NULL,
    [include_in_report] [char](1) NOT NULL,
    [include_in_internet] [char](1) NOT NULL,
    [created_date] [datetime] NOT NULL,
    [row_version] [timestamp] NOT NULL,
    [report_section_number] [smallint] NULL,
    [external_notes] [text] NULL,
    [source_filename] [varchar](255) NULL,
    [page_count] [int] NOT NULL,
    [include_as_attachment] [char](1) NOT NULL,
    [sort_order] [smallint] NULL,
    [acc_parent_image_id] [int] NULL,
    [created_by_id] [int] NULL,
    [annotation] [text] NULL,
    [specimen_results_enabled] [char](1) NOT NULL,
    [dis_imageserver_id] [int] NULL,
    [external_slide_image_id] [varchar](80) NULL,
    [external_report_image_id] [varchar](80) NULL,
)   

Here is the target table schema. I have an inserted_date that I don't know if I can use to optimize:

CREATE TABLE [dbo].[acc_image]
(
    [acc_image_id] [int] NOT NULL,
    [acc_id] [int] NULL,
    [image_type_id] [int] NOT NULL,
    [data_format] [varchar](10) NOT NULL,
    [label] [varchar](50) NOT NULL,
    [description] [varchar](255) NULL,
    [image_width] [int] NULL,
    [image_height] [int] NULL,
    [image_color_depth] [tinyint] NULL,
    [image_thumbnail] [image] NOT NULL,
    [data] [image] NOT NULL,
    [image_guid] [uniqueidentifier] NULL,
    [created_date] [datetime] NOT NULL,
    [row_version] [varbinary](12) NOT NULL,
    [sort_order] [int] NULL,
    [insert_date] [datetime] NOT NULL,
    [updated_date] [datetime] NULL,
) 

Here's the stored procedure code:

ALTER PROCEDURE [dbo].[get_image]
AS
BEGIN
    DECLARE @RecCt AS int = 0

    BEGIN TRY
        INSERT INTO connect_onprem.dbo.acc_image (acc_image_id, acc_id, image_type_id,    
                                                  data_format, label, description, 
                                                  image_width, image_height, 
                                                  image_color_depth, image_thumbnail, 
                                                  data, created_date, row_version, sort_order)
        SELECT 
            src.acc_image_id, a.id, src.image_type_id, 
            src.data_format, src.label, src.description, 
            src.image_width, src.image_height,
            src.image_color_depth, src.image_thumbnail,
            src.data, src.created_date, src.row_version, src.sort_order
        FROM 
            [ARKPPTEST\POWERPATHTEST].[Powerpath_Test].[dbo].accession_2 a 
        INNER JOIN 
            [ARKPPTEST\POWERPATHTEST].[Powerpath_Test].[dbo].acc_specimen s ON a.primary_specimen_id = s.id
        INNER JOIN 
            [ARKPPTEST\POWERPATHTEST].[Powerpath_Test].[dbo].acc_slide ass ON s.id = ass.acc_specimen_id
        --source table      
        INNER JOIN 
            [ARKPPTEST\POWERPATHTEST].[Powerpath_Test].[dbo].acc_image src ON ass.id = src.acc_slide_id
    
        --target table
        LEFT JOIN 
            connect_onprem.dbo.acc_image tgt ON src.acc_image_id = tgt.acc_image_id
        WHERE 
            tgt.acc_image_id IS NULL
            AND a.acc_type_id <> 134
            AND a.status_final = 'Y'
        ORDER BY 
            acc_image_id
    
    SET @RecCt = @@ROWCOUNT

    IF @RecCt > 0
    BEGIN
        INSERT INTO connect_onprem.dbo.ErrorLog (UserName, ErrorNumber, ErrorState, 
                        ErrorSeverity, ErrorLine,  ErrorProcedure, ErrorMsg, ErrorDateTime)
        Values ('RecTrack', @RecCt, 0, 0, 0, 'get_image', 
                            'connect_onprem.dbo.acc_image Inserted Records', GETDATE());
    END


END TRY
BEGIN CATCH
INSERT INTO  connect_onprem.dbo.ErrorLog (UserName, ErrorNumber, ErrorState, ErrorSeverity, 
                  ErrorLine, ErrorProcedure,  ErrorMsg, ErrorDateTime)
VALUES (SUSER_SNAME(), ERROR_NUMBER(), ERROR_STATE(), ERROR_SEVERITY(), ERROR_LINE(), 
            ERROR_PROCEDURE(), ERROR_MESSAGE(), GETDATE());
    END CATCH
END

Goal: optimize this to run as fast as possible; to read the source table and insert "new" records into the target table using a linked server running across the network.

Right now, the stored procedure is running over 1 hour and 22 minutes. I need to somehow change this to be able to run at the minimum every 1 minute (if possible).


Solution

  • I suspect that the remote part of the query is retrieving and transferring all selected data (including the presumably large data and image_thumbnail values) from the remote database before checking locally for the existence of a local copy.

    I would assume that once the initial load is complete, most of the retrieved data will be discarded as duplicate. For example, you might have one million records in your remote source, but only one thousand need to be inserted as new rows to your target table.

    The solution may be to initially just select ID values into a temp table (or table variable) and then use that in a second query to actually select and insert the final data. retrieving one million ID values to be checked up front should be immensely faster than retrieving one million complete rows. Later, after the IDs have been filtered, retrieving just those one thousand rows may complete relatively fast.

    Something like:

        DECLARE @Selected_Ids TABLE(acc_image_id int not null)
    
        -- Pre-select IDs
        INSERT INTO @Selected_Ids(acc_image_id)
        SELECT
            src.acc_image_id
        FROM [RemoteServer].[Powerpath_Test].[dbo].accession_2 a 
        INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_specimen s ON 
                                        a.primary_specimen_id = s.id
        INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_slide ass ON  
                                        s.id = ass.acc_specimen_id
               --source table       
              INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_image src  on   
                                        ass.id = src.acc_slide_id
        
              --target table
              LEFT JOIN connect_onprem.dbo.acc_image tgt  ON src.acc_image_id = tgt.acc_image_id
        WHERE tgt.acc_image_id is null 
        AND a.acc_type_id <> 134
        AND a.status_final = 'Y'
        -- order by acc_image_id
        
        -- Main select
        INSERT INTO connect_onprem.dbo.acc_image(acc_image_id, acc_id, image_type_id,    
    data_format, label, description, image_width, image_height, image_color_depth, image_thumbnail, 
    data, created_date, row_version, sort_order)
        SELECT 
        src.acc_image_id, 
        a.id, 
        src.image_type_id, 
        src.data_format, 
        src.label, 
        src.description, 
        src.image_width,
        src.image_height,
        src.image_color_depth,
        src.image_thumbnail,
        src.data,
        src.created_date,
        src.row_version,
        src.sort_order
    
        FROM [RemoteServer].[Powerpath_Test].[dbo].accession_2 a 
        INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_specimen s ON 
                                        a.primary_specimen_id = s.id
        INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_slide ass ON  
                                        s.id = ass.acc_specimen_id
        INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_image src  on   
                                        ass.id = src.acc_slide_id
        WHERE src.acc_image_id IN (SELECT sel.acc_image_id FROM @Selected_Ids sel)
        --order by src.acc_image_id
    

    It appears that you can eliminate several joins from the second query by saving the accession ID during the preselect. Since all of the filters have already been applied, and the rest of the data comes from the acc_image table, the other tables need not be referenced.

        DECLARE @Selected_Ids TABLE(acc_image_id int not null, acc_id int not null)
    
        -- Pre-select IDs
        INSERT INTO @Selected_Ids(acc_image_id, acc_id)
        SELECT
            src.acc_image_id,
            a.id
        FROM [RemoteServer].[Powerpath_Test].[dbo].accession_2 a 
        INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_specimen s ON 
                                        a.primary_specimen_id = s.id
        INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_slide ass ON  
                                        s.id = ass.acc_specimen_id
               --source table       
              INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_image src  on   
                                        ass.id = src.acc_slide_id
        
              --target table
              LEFT JOIN connect_onprem.dbo.acc_image tgt  ON src.acc_image_id = tgt.acc_image_id
        WHERE tgt.acc_image_id is null 
        AND a.acc_type_id <> 134
        AND a.status_final = 'Y'
        -- order by acc_image_id
        
        -- Main select
        INSERT INTO connect_onprem.dbo.acc_image(acc_image_id, acc_id, image_type_id,    
    data_format, label, description, image_width, image_height, image_color_depth, image_thumbnail, 
    data, created_date, row_version, sort_order)
        SELECT 
        src.acc_image_id, 
        sel.acc_id,
        src.image_type_id, 
        src.data_format, 
        src.label, 
        src.description, 
        src.image_width,
        src.image_height,
        src.image_color_depth,
        src.image_thumbnail,
        src.data,
        src.created_date,
        src.row_version,
        src.sort_order
    
        FROM @Selected_Ids sel
        INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_image src on   
                                        sel.acc_image_id = src.acc_image_id
        --order by src.acc_image_id
    

    The query engine might perform a loop join to retrieve one remote acc_image row at a time or it might send a list of all preselected ids to the remote server for a batch retrieval. You should run performance tests on both and examine the resulting execution plans.

    You might even try the following hybrid of the above in the second query:

        ...
        FROM [RemoteServer].[Powerpath_Test].[dbo].acc_image src
        INNER JOIN @Selected_Ids sel on
                                        src.acc_image_id = sel.acc_image_id
        WHERE src.acc_image_id IN (SELECT sel1.acc_image_id FROM @Selected_Ids sel1)
    

    (Note that I commented out the order by, as I believe it to be unnecessary.)