Search code examples
sqlrrodbc

Counting null and missing values using RODBC package


I am working to create a a matrix of missingness for a SQL database consisting of 5 tables and nearly 10 years of data. I have established ODBC connectivity and am using the RODBC package in R as my working environment. I am trying to write a function that will output a count of rows for each year for each table, a count and percent of null values (values not present) in a given year for a given table, and a count and percent of missing (questions skipped/not answered) values for a given table. I have written the code below, trying to get it to work on one variable then turning it into a function once it works. However, when I run this code, the count for total, missing and null values are all the same and the percent of course is 1. I am not getting any error messages. I am not sure where the issue lies and it is important to distinguish between missing and null for this project. Any insight is much appreciated.

test1 <- sqlQuery(channel, "

SELECT [event_year] AS 'YEAR', 
       Count(*)     AS 'TOTAL', 
       Count(CASE 
               WHEN mother_education_trendable = 'NA' THEN 1 
               ELSE 0 
             END)   AS 'NULL_VAL', 
       Count(CASE 
               WHEN mother_education_trendable = -1 THEN 1 
               ELSE 0 
             END)   AS 'MISS_VAL' 
FROM   [GA_CMH].[dbo].[births] 
GROUP  BY [event_year] 
ORDER  BY [event_year] 

")

test1$nullpct<-with(test1, NULL_VAL/TOTAL)

test1$misspct<-with(test1, MISS_VAL/TOTAL)


Solution

  • Your current CASE statement inside the Count aggregate will populate either 1 or 0 both will be considered in Count aggregate so you are getting same count as total.

    Zero is a value that will be counted in Count aggregate so remove the ELSE part in CASE statement by default NULL will be populated non matching conditions which will not be counted/considered in COUNT aggregate

    SELECT [event_year] AS 'YEAR', 
           Count(*)     AS 'TOTAL', 
           Count(CASE 
                   WHEN mother_education_trendable = 'NA' THEN 1 
                 END)   AS 'NULL_VAL', 
           Count(CASE 
                   WHEN mother_education_trendable = -1 THEN 1 
                 END)   AS 'MISS_VAL' 
    FROM   [GA_CMH].[dbo].[births] 
    GROUP  BY [event_year] 
    ORDER  BY [event_year] 
    

    Or use SUM aggregate instead of COUNT

    SELECT [event_year] AS 'YEAR', 
           Count(*)     AS 'TOTAL', 
           SUM(CASE 
                   WHEN mother_education_trendable = 'NA' THEN 1 ELSE 0
                 END)   AS 'NULL_VAL', 
           SUM(CASE 
                   WHEN mother_education_trendable = -1 THEN 1 ELSE 0
                 END)   AS 'MISS_VAL' 
    FROM   [GA_CMH].[dbo].[births] 
    GROUP  BY [event_year] 
    ORDER  BY [event_year]