I have developed an online survey that stores my data in a Microsoft SQL 2005 database. I have written a set of outlier checks on my data in R. The general workflow for these scripts is:
The table I am writing back to has the structure:
CREATE TABLE outliers2(
modelid int
, password varchar(50)
, reason varchar(50),
Constraint PK_outliers2 PRIMARY KEY(modelid, reason)
)
GO
As you can see, I've set the primary key to be modelid and reason. The same respondent may be an outlier for multiple checks, but I do not want to insert the same modelid and reason combo for any respondent.
Since we are still collecting data, I would like to be able to update these scripts on a daily / weekly basis as I develop the models I am estimating on the data. Here is the general form of the sqlSave()
command I'm using:
sqlSave(db, db.insert, "outliers2", append = TRUE, fast = FALSE, rownames = FALSE)
where db
is a valid ODBC Connection and db.insert
has the form
> head(db.insert)
modelid password reason
1 873 abkd WRONG DIRECTION
2 875 ab9d WRONG DIRECTION
3 890 akdw WRONG DIRECTION
4 905 pqjd WRONG DIRECTION
5 941 ymne WRONG DIRECTION
6 944 okyt WRONG DIRECTION
sqlSave()
chokes when it tries to insert a row that violates the primary key constraint and does not continue with the other records for the insert. I would have thought that setting fast = FALSE
would have alleviated this problem, but it doesn't.
Any ideas on how to get around this problem? I could always drop
the table at the beginning of the first script, but that seems pretty heavy handed and will undoubtedly lead to problems down the road.
In this case, everything is working as expected. You uploading everything as a batch and SQL Server is stopping the batch as soon as it finds an error. Unfortunately, I don't know of a graceful built-in solution. But, I think it is possible to build a system in the database to handle this more efficiently. I like doing data storage/management in databases rather than within R, so my solution is very database heavy. Others may offer you a solution that is more R oriented.
First, create a simple table, without constraints, to hold your new rows and adjust your sqlSave statement accordingly. This is where R will upload the information to.
CREATE TABLE tblTemp(
modelid int
, password varchar(50)
, reason varchar(50)
, duplicate int()
)
GO
Your query to put information into this table should assume 'No' for the column 'duplicate'. I use a pattern where 1=Y & 5=N. You could also only mark those that are outliers but I tend to prefer to be explicit with my logic.
You will also need a place to dump all rows which violate the PK in outliers2.
CREATE TABLE tblDuplicates(
modelid int
, password varchar(50)
, reason varchar(50)
)
GO
OK. Now all you need to do is to create a trigger to move the new rows from tblTemp to outliers2. This trigger will move all duplicate rows to tblDuplicates for later handling, deletion, whatever.
CREATE TRIGGER FindDups
ON tblOutliersTemp
AFTER INSERT
AS
I'm not going to go through and write the entire trigger. I don't have a SQL Server 2005 to test it against and I would probably make a syntax error and I don't want to give you bad code, but here's what the trigger needs to do:
The nice part about doing it this way is sqlSave() won't error out just because you have a violation of your PK and you can deal with the matches at a later time, like tomorrow. :-)