Search code examples
sqlrrodbc

Trouble counting null and missing values (and differentiating between the two) using RODBC package


I am working to create a a matrix of missingness for a SQL database consisting of 5 tables and nearly 10 years of data. I have established ODBC connectivity and am using the RODBC package in R as my working environment. I am trying to write a function that will output a count of rows for each year for each table, a count and percent of null values (values not present) in a given year for a given table, and a count and percent of missing (questions skipped/not answered) values for a given table. I am working with the code below, trying to get it to work on one variable then turning it into a function once it works. However, when I run this code(see below), it appears to not be working, and I believe the issue lies with assigning an integer value to the character for null, NA. I am getting this message when trying to list vars in the function:

Error in as.environment(pos) : no item called "22018 245 [Microsoft][ODBC SQL Server Driver][SQL Server]Conversion failed when converting the varchar value 'NA' to data type int." on the search list.

Also, when I try to find the environment for the function, R returns NULL. I do not necessarily want to assign a new value to the already existent variable, and I new to SQL, but I am trying to do something along these lines If X = 'NA' then Y = 1 else 0. I get the following error message when I try to run the final 2 lines creating the percent vars:

Error in eval(substitute(expr), data, enclos = parent.frame()) : invalid 'envir' argument of type 'character'

Any insight?

test1 <- sqlQuery(channel, "select

              [EVENT_YEAR] AS 'YEAR', 
              COUNT(*) AS 'TOTAL',
              SUM(CASE WHEN MOTHER_EDUCATION_TRENDABLE = 'NA' THEN 1 ELSE 0 END) AS 'NULL_VAL',
              SUM(CASE WHEN MOTHER_EDUCATION_TRENDABLE = -1 THEN 1 ELSE 0 END) AS 'MISS_VAL'

              from [GA_CMH].[dbo].[BIRTHS]

              GROUP BY [EVENT_YEAR]
              ORDER BY [EVENT_YEAR]")

              test1$nullpct<-with(test1, NULL_VAL/TOTAL)
              test1$misspct<-with(test1, MISS_VAL/TOTAL)

Solution

  • I believe the data type of your column MOTHER_EDUCATION_TRENDABLE is an integer, if so, try:

    
        select
        [EVENT_YEAR] AS 'YEAR',
        COUNT(*) AS 'TOTAL',
        SUM(CASE WHEN MOTHER_EDUCATION_TRENDABLE IS NULL THEN 1 ELSE 0 END) AS 'NULL_VAL', 
        SUM(CASE WHEN MOTHER_EDUCATION_TRENDABLE = -1 THEN 1 ELSE 0 END) AS 'MISS_VAL'
        from [GA_CMH].[dbo].[BIRTHS]
        GROUP BY [EVENT_YEAR]
        ORDER BY [EVENT_YEAR]