I am working to create a a matrix of missingness for a SQL database consisting of 5 tables and nearly 10 years of data. I have established ODBC connectivity and am using the RODBC package in R as my working environment. I am trying to write a function that will output a count of rows for each year for each table, a count and percent of null values (values not present) in a given year for a given table, and a count and percent of missing (questions skipped/not answered) values for a given table. I have written the code below, trying to get it to work on one variable then turning it into a function once it works. However, when I run this code, the count for total, missing and null values are all the same and the percent of course is 1. I am not getting any error messages. I am not sure where the issue lies and it is important to distinguish between missing and null for this project. Any insight is much appreciated.
test1 <- sqlQuery(channel, "
SELECT [event_year] AS 'YEAR',
Count(*) AS 'TOTAL',
Count(CASE
WHEN mother_education_trendable = 'NA' THEN 1
ELSE 0
END) AS 'NULL_VAL',
Count(CASE
WHEN mother_education_trendable = -1 THEN 1
ELSE 0
END) AS 'MISS_VAL'
FROM [GA_CMH].[dbo].[births]
GROUP BY [event_year]
ORDER BY [event_year]
")
test1$nullpct<-with(test1, NULL_VAL/TOTAL)
test1$misspct<-with(test1, MISS_VAL/TOTAL)
Your current CASE
statement inside the Count
aggregate will populate either 1
or 0
both will be considered in Count
aggregate so you are getting same count as total.
Zero is a value that will be counted in Count
aggregate so remove the ELSE
part in CASE
statement by default NULL
will be populated non matching conditions which will not be counted/considered in COUNT
aggregate
SELECT [event_year] AS 'YEAR',
Count(*) AS 'TOTAL',
Count(CASE
WHEN mother_education_trendable = 'NA' THEN 1
END) AS 'NULL_VAL',
Count(CASE
WHEN mother_education_trendable = -1 THEN 1
END) AS 'MISS_VAL'
FROM [GA_CMH].[dbo].[births]
GROUP BY [event_year]
ORDER BY [event_year]
Or use SUM
aggregate instead of COUNT
SELECT [event_year] AS 'YEAR',
Count(*) AS 'TOTAL',
SUM(CASE
WHEN mother_education_trendable = 'NA' THEN 1 ELSE 0
END) AS 'NULL_VAL',
SUM(CASE
WHEN mother_education_trendable = -1 THEN 1 ELSE 0
END) AS 'MISS_VAL'
FROM [GA_CMH].[dbo].[births]
GROUP BY [event_year]
ORDER BY [event_year]