Search code examples
sql-server-2005rrodbc

RODBC sqlSave() stopping insert query when PK violated


I have developed an online survey that stores my data in a Microsoft SQL 2005 database. I have written a set of outlier checks on my data in R. The general workflow for these scripts is:

  1. Read data from SQL database with sqlQuery()
  2. Perform outlier analysis
  3. Write offending respondents back to database in separate table using sqlSave()

The table I am writing back to has the structure:

CREATE TABLE outliers2(
    modelid int
    , password varchar(50)
    , reason varchar(50),
Constraint PK_outliers2 PRIMARY KEY(modelid, reason)
)
GO

As you can see, I've set the primary key to be modelid and reason. The same respondent may be an outlier for multiple checks, but I do not want to insert the same modelid and reason combo for any respondent.

Since we are still collecting data, I would like to be able to update these scripts on a daily / weekly basis as I develop the models I am estimating on the data. Here is the general form of the sqlSave() command I'm using:

sqlSave(db, db.insert, "outliers2", append = TRUE, fast = FALSE, rownames = FALSE)

where db is a valid ODBC Connection and db.insert has the form

> head(db.insert)
  modelid password          reason
1     873       abkd WRONG DIRECTION
2     875       ab9d WRONG DIRECTION
3     890       akdw WRONG DIRECTION
4     905       pqjd WRONG DIRECTION
5     941       ymne WRONG DIRECTION
6     944       okyt WRONG DIRECTION

sqlSave() chokes when it tries to insert a row that violates the primary key constraint and does not continue with the other records for the insert. I would have thought that setting fast = FALSE would have alleviated this problem, but it doesn't.

Any ideas on how to get around this problem? I could always drop the table at the beginning of the first script, but that seems pretty heavy handed and will undoubtedly lead to problems down the road.


Solution

  • In this case, everything is working as expected. You uploading everything as a batch and SQL Server is stopping the batch as soon as it finds an error. Unfortunately, I don't know of a graceful built-in solution. But, I think it is possible to build a system in the database to handle this more efficiently. I like doing data storage/management in databases rather than within R, so my solution is very database heavy. Others may offer you a solution that is more R oriented.

    First, create a simple table, without constraints, to hold your new rows and adjust your sqlSave statement accordingly. This is where R will upload the information to.

    CREATE TABLE tblTemp(
        modelid int
        , password varchar(50)
        , reason varchar(50)
        , duplicate int()
    )
    GO
    

    Your query to put information into this table should assume 'No' for the column 'duplicate'. I use a pattern where 1=Y & 5=N. You could also only mark those that are outliers but I tend to prefer to be explicit with my logic.

    You will also need a place to dump all rows which violate the PK in outliers2.

    CREATE TABLE tblDuplicates(
        modelid int
        , password varchar(50)
        , reason varchar(50)
    )
    GO
    

    OK. Now all you need to do is to create a trigger to move the new rows from tblTemp to outliers2. This trigger will move all duplicate rows to tblDuplicates for later handling, deletion, whatever.

    CREATE TRIGGER FindDups
    ON tblOutliersTemp
    AFTER INSERT
    AS 
    

    I'm not going to go through and write the entire trigger. I don't have a SQL Server 2005 to test it against and I would probably make a syntax error and I don't want to give you bad code, but here's what the trigger needs to do:

    1. Identify all rows in tblTemp that would violate the PK in outliers2. Where duplicates are found, change the duplicates to 1. This would be done with an UPDATE statement.
    2. Copy all rows where duplicate=1 to tblDuplicates. You would do this with an INSERT INTO tblDuplicates ......
    3. Now copy the non-duplicate rows to outliers2 with an INSERT INTO statement that looks almost exactly like the one used in step 2.
    4. DROP all rows from tblTemp, to clear it out for your next batch of updates. This step is important.

    The nice part about doing it this way is sqlSave() won't error out just because you have a violation of your PK and you can deal with the matches at a later time, like tomorrow. :-)