Search code examples
rsql-serverrodbc

SQL Server Query failing when executed from R


I have the following simple SQL server code:

set nocount on;
if OBJECT_ID('tempdb..#A') IS NOT NULL DROP TABLE #A;

set nocount on;
create table #A
( obj_id int,
  obj_name varchar(50),
  obj_dt   datetime);

  insert into #A (
  obj_id,
  obj_name,
  obj_dt)
  values
  ( 1
   ,'name'
   ,'2019-01-01 00:00:00'
  ),
  ( 2
   ,NULL
   ,NULL
  ),
  ( 2
   ,'alias'
   ,'2019-02-01 00:00:00'
  );

set nocount on;
if OBJECT_ID('tempdb..#B') IS NOT NULL DROP TABLE #B;

set nocount on;
select 
 #A.obj_id
,subq.obj_name
,subq.obj_dt
into #B 
from #A
join (select
        obj_id,
        max(obj_name) as obj_name,
        max(obj_dt) as obj_dt
      from #A
      group by obj_id) as subq
on #A.obj_id = subq.obj_id;

set nocount on;
select * from #B;

which, as expected, returns the following data when executed in Microsoft SQL Server Management Studio:

obj_id  obj_name    obj_dt
1       name        2019-01-01 00:00:00.000
2       alias       2019-02-01 00:00:00.000
2       alias       2019-02-01 00:00:00.000

So far so good. Now I wish to run this code from R and have that same output returned to R. I have the query above stored in the string query and my RODBC connection stored in the variable connection, and attempt to retrieve the data with

sqlQuery(connection,query)

The result is character(0). However, if I modify the query above by commenting out the subq.obj_name and subq.obj_dt fields in the definition of #B then the code successfully returns the expected dataset

obj_id
1      1
2      2
3      2

from R.

So what is going on here? Both sql queries are valid and run successfully in the Microsoft SQL server environment, but only one works when piped through R. I can't figure out what accounts for the failure of the RODBC code to handle the second query.


Solution

  • Okay, so I think I've figured out what is going wrong here. The subquery

    select
        obj_id,
        max(obj_name) as obj_name,
        max(obj_dt) as obj_dt
        from #A
    group by obj_id
    

    produces a hidden warning. You don't see the warning if you just run the code as is, but if you store the output in a temporary table then the warning message is produced:

    select
        obj_id,
        max(obj_name) as obj_name,
        max(obj_dt) as obj_dt
        into #C
        from #A
    group by obj_id
    

    Warning: Null value is eliminated by an aggregate or other SET operation.

    The warning is hidden when this is run as part of the subquery in the original SQL code in the question. I believe this message is somehow part of output that R is "seeing", and once R sees that output it terminates the query. But since no results have been returned yet the output in R is empty (i.e. character(0)).

    To solve this issue I coalesced the variables that I'm computing the max of to some minimal values (I'm not sure about what the minimal character is in sql server collation but '0' worked for my purposes). The idea is to remove NULL values before aggregation so no warning is generated. The final working SQL code is below:

    set nocount on;
    if OBJECT_ID('tempdb..#A') IS NOT NULL DROP TABLE #A;
    
    set nocount on;
    create table #A
    ( obj_id int,
      obj_name varchar(50),
      obj_dt   datetime);
    
      insert into #A (
      obj_id,
      obj_name,
      obj_dt)
      values
      ( 1
       ,'name'
       ,'2019-01-01 00:00:00'
      ),
      ( 2
       ,NULL
       ,NULL
      ),
      ( 2
       ,'alias'
       ,'2019-02-01 00:00:00'
      );
    
    set nocount on;
    if OBJECT_ID('tempdb..#B') IS NOT NULL DROP TABLE #B;
    
    set nocount on;
    select 
     #A.obj_id
    ,subq.obj_name
    ,subq.obj_dt
    into #B 
    from #A
    join
    (select
            obj_id,
            max(isnull(obj_name,'0')) as obj_name,
            max(isnull(obj_dt,cast(-1 as datetime))) as obj_dt
          from #A
          group by obj_id) as subq 
    on #A.obj_id = subq.obj_id;
    
    set nocount on;
    select * from #B; 
    

    I believe this behavior should be addressed in the RODBC package as it is quite likely to trip up others and can be a bit tricky to track down the root cause and troubleshoot.