I have a Microsoft SQL Server database which I have connected to with an ODBC link. There are three columns of interest in the table I am querying: serovar
, reportyear
and hospitalised
.
Each row represents one case. I want to collect an aggregated summary table into R where each row is a serovar, each column is a reportyear and each cell value represents the percentage hospitalised.
For example, if there were a total of 3 cases of serovar x in 2020, and only 2 were hospitalised, I would expect the percentage value in the cell to be 66.67%.
Here is some example data:
# Create example data:
dt <- data.frame(
pid = c(1,2,3,4,5,6,7,8),
serovar = c("x", "x", "x", "x", "y", "y", "y", "y"),
reportyear = c(2020, 2020, 2020, 2021, 2020, 2020, 2021, 2021),
hospitalised = c("Y", "Y", "N", "N", NA, "Y", NA, "N"))
# Output:
> dt
pid serovar reportyear hospitalised
1 1 x 2020 Y
2 2 x 2020 Y
3 3 x 2020 N
4 4 x 2021 N
5 5 y 2020 <NA>
6 6 y 2020 Y
7 7 y 2021 <NA>
8 8 y 2021 N
My lazy connection to the database is called db
:
# Create lazy connection to database:
db <- tbl(con,
category = params$category,
schema = "clean",
table = viewname)
Then I select the relevant columns, filter the data and calculate the percentage hospitalised:
summarytab <- db %>%
# Select required columns:
select(serovar, reportyear, hospitalised) %>%
# Re-code hospital:
mutate(hospitalised_tf = case_when(
hospitalised == "Y" ~ TRUE,
hospitalised == "N" ~ FALSE,
.default = NA
)) %>%
# Filter rows by year:
filter(between(reportyear, 2020, 2024)
& !is.na(serovar)
& !is.na(reportyear)
& !is.na(hospitalised_tf)) %>%
# Group by serovar and year:
group_by(serovar, reportyear) %>%
# Calculate total cases and number hospitalised by serovar and year:
summarise(total_cases = n(),
hosp_sum = sum(hospitalised_tf, na.rm = TRUE)) %>%
# Calculate percentage hospitalised:
mutate(hosp_pct = (hosp_sum/total_cases)*100) %>%
# Arrange by year and serovar:
arrange(serovar, reportyear) %>%
# Collect to check results:
collect()
If I run this code on a normal data.frame within R, I get the expected result:
> summarytab
# A tibble: 4 × 5
# Groups: serovar [2]
serovar reportyear total_cases hosp_sum hosp_pct
<chr> <dbl> <int> <int> <dbl>
1 x 2020 3 2 66.7
2 x 2021 1 0 0
3 y 2020 1 1 100
4 y 2021 1 0 0
The problem is that the live database connection gives me 0s or 100s in the percentage column instead of the actual percentage. So I'm assuming something about the conversion to SQL is not right, but can't figure out what. Here is the SQL query:
show_query(summarytab)
<SQL>
SELECT "q01".*, ("hosp_sum" / "total_cases") * 100.0 AS "hosp_pct"
FROM (
SELECT
"serovar",
"reportyear",
COUNT(*) AS "total_cases",
SUM("hospitalised_tf") AS "hosp_sum"
FROM (
SELECT "q01".*
FROM (
SELECT
"q01".*,
CASE
WHEN ("hospitalised" = 'Y') THEN 1
WHEN ("hospitalised" = 'N') THEN 0
ELSE NULL
END AS "hospitalised_tf"
FROM (
SELECT
"Serotype" AS "serovar",
"DateUsedForStatisticsYear" AS "reportyear",
"Hospitalisation" AS "hospitalised"
FROM "FWD"."clean"."SALM_Case"
) "q01"
) "q01"
WHERE ("reportyear" BETWEEN 2020.0 AND 2024.0 AND NOT(("serovar" IS NULL)) AND NOT(("reportyear" IS NULL)) AND NOT(("hospitalised_tf" IS NULL)))
) "q01"
GROUP BY "serovar", "reportyear"
) "q01"
ORDER BY "serovar", "reportyear"
>
There seems to be a lot of repetition in the query, and the percentage calculation is at the top, which seems strange to me - but I can't see any issues with the way the calculation itself has been written.
I would be grateful for an explanation as to why I get a correct result when the percentage calculation is done in the R environment, but an incorrect result when I try to do the calculation via SQL query on the server.
@siggemannen's comment helped me identify the solution - so putting this here in case anyone comes across the same difficulty.
The crux of the issue is that the inputs to the percentage calculation (columns hosp_sum
and total_cases
) are all whole numbers, and are therefore being stored as integers. As explained in this Stack Overflow post numbers stored as numeric can take up more space than integers, because they are stored in two pieces (see here for a demonstration of size differences between integers and numeric in R).
SQL is designed to interact with databases and therefore prioritises space saving / computation time over mathematical convenience. Because I supplied two integers as inputs to the percentage calculation, it defaulted to providing an integer as the result - either rounding up or down (hence the 0s and 100s). As @siggemannen points out, just converting one of the two input numbers from integer to numeric is sufficient to force the result to be stored as numeric as well. I can do this with dbplyr
using the base R as.numeric()
function and then pass this directly to the percentage calculation, keeping it all on the SQL server:
# Calculate total cases and number hospitalised by serovar and year:
summarise(total_cases = as.numeric(n()),
hosp_sum = sum(hospitalised_tf, na.rm = TRUE)) %>%
# Calculate percentage hospitalised:
mutate(hosp_pct = round((hosp_sum/total_cases)*100, digits = 2)) %>%
# Arrange by year and serovar:
arrange(reportyear, serovar) %>%
# Select columns:
select(serovar, reportyear, hosp_pct) %>%
# Collect the resultant table in a data.frame:
collect()
This gives me the correct numeric result (i.e. 2/3 cases hospitalised = 66.67%).
R, on the other hand, prioritises flexibility and numeric accuracy, which is why it will present the fractional result of a calculation as numeric, even when both inputs were stored as integers. As explained in this dbplyr vignette:
R and SQL have different defaults for integers and reals. In R, 1 is a real, and 1L is an integer. In SQL, 1 is an integer, and 1.0 is a real
Also in R, integers are 'a special case of numeric' while in SQL, integers and real numbers (SQL-speak for numeric) are completely separate data types.