Search code examples
rsql-serverdplyrodbcdbplyr

tbl and mutate dplyr: incorrect formatting of variables


I am currently using dplyr::tbl to interact with SQL databases in Azure. I like the ability to delay evaluation of a query until it is required and use this within a number of functions.

One thing that I need to do sometimes is modify how tbl() reads data into the R environment. However, I have just noticed some unexpected behaviour when using mutate to do this. Please see below.

# Load library    
library(dplyr)
library(DBI)
library(odbc)
library(purrr)

Connect to db using odbc

odbc::dbConnect(odbc::odbc(),
                Driver = "ODBC Driver 17 for SQL Server",
...)

    > class(db_con)
[1] "Microsoft SQL Server"
attr(,"package")
[1] ".GlobalEnv"

> # Azure Database
> # Microsoft SQL Server Version: 12.00.2195

Specify test data using mtcars, replace the first row with NAs.

    > mtcars[1,] <- NA
    > mtcars_mod <- as_tibble(mtcars)
    > head(mtcars_mod)
    # A tibble: 6 x 11
        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    1  NA      NA    NA    NA NA    NA     NA      NA    NA    NA    NA
    2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
    3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
    4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
    5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
    6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

> # See what will be mapped
> purrr::map(mtcars_mod, ~dbDataType(db_con, .))
$mpg
[1] "FLOAT"

$cyl
[1] "FLOAT"

$disp
[1] "FLOAT"

$hp
[1] "FLOAT"

$drat
[1] "FLOAT"

$wt
[1] "FLOAT"

$qsec
[1] "FLOAT"

$vs
[1] "FLOAT"

$am
[1] "FLOAT"

$gear
[1] "FLOAT"

$carb
[1] "FLOAT"

> 
> # write to database
> dbWriteTable(db_con, "mtcars_mod", mtcars_mod, overwrite = TRUE)

Now the data has been written to the database I want to bring it back into the environment. All variables are currently "float" in the schema.

> # Returns the correct result using 
> dbReadTable(db_con, "mtcars_mod") %>% head()
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1   NA  NA   NA  NA   NA    NA    NA NA NA   NA   NA
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
> 
> # using dplyr::tbl - returns the correct result
> dplyr::tbl(db_con, "mtcars_mod") %>% collect()
# A tibble: 32 x 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  NA      NA   NA     NA NA    NA     NA      NA    NA    NA    NA
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ... with 22 more rows

This is all fine to this point, but what if I want to change how the data is read into the environement before calling to collect(). Lets see...

> # What if changing the format of a column?
> # This completely changes the expression of the variables
> # mpg is correctly aligned as chr, but for remaining variables 0s or other values
> # are returned for missing values and this changes on each call.
> 
> (call_1 <- tbl(db_con, "mtcars_mod") %>% 
+   mutate(mpg = as.character(mpg)) %>% 
+   collect())
# A tibble: 32 x 11
   mpg         cyl      disp    hp   drat        wt  qsec        vs        am  gear      carb
   <chr>     <dbl>     <dbl> <dbl>  <dbl>     <dbl> <dbl>     <dbl>     <dbl> <dbl>     <dbl>
 1 NA    2.37e-307 2.73e-310   NaN NaN    -5.55e303 NaN   2.37e-307 8.49e-314     0 2.12e-314
 2 21    6   e+  0 1.6 e+  2   110   3.9   2.88e  0  17.0 0         1   e+  0     4 4   e+  0
 3 22.8  4   e+  0 1.08e+  2    93   3.85  2.32e  0  18.6 1   e+  0 1   e+  0     4 1   e+  0
 4 21.4  6   e+  0 2.58e+  2   110   3.08  3.22e  0  19.4 1   e+  0 0             3 1   e+  0
 5 18.7  8   e+  0 3.6 e+  2   175   3.15  3.44e  0  17.0 0         0             3 2   e+  0
 6 18.1  6   e+  0 2.25e+  2   105   2.76  3.46e  0  20.2 1   e+  0 0             3 1   e+  0
 7 14.3  8   e+  0 3.6 e+  2   245   3.21  3.57e  0  15.8 0         0             3 4   e+  0
 8 24.4  4   e+  0 1.47e+  2    62   3.69  3.19e  0  20   1   e+  0 0             4 2   e+  0
 9 22.8  4   e+  0 1.41e+  2    95   3.92  3.15e  0  22.9 1   e+  0 0             4 2   e+  0
10 19.2  6   e+  0 1.68e+  2   123   3.92  3.44e  0  18.3 1   e+  0 0             4 4   e+  0
# ... with 22 more rows
> 
> (call_2 <- tbl(db_con, "mtcars_mod") %>% 
+   mutate(mpg = as.character(mpg)) %>% 
+   collect())
# A tibble: 32 x 11
   mpg         cyl      disp        hp  drat        wt  qsec        vs        am      gear      carb
   <chr>     <dbl>     <dbl>     <dbl> <dbl>     <dbl> <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1 NA    2.12e-314 2.37e-307 2.12e-314  0    2.12e-314 NaN   2.12e-314 2.37e-307 2.12e-314 8.49e-314
 2 21    6   e+  0 1.6 e+  2 1.1 e+  2  3.9  2.88e+  0  17.0 0         1   e+  0 4   e+  0 4   e+  0
 3 22.8  4   e+  0 1.08e+  2 9.3 e+  1  3.85 2.32e+  0  18.6 1   e+  0 1   e+  0 4   e+  0 1   e+  0
 4 21.4  6   e+  0 2.58e+  2 1.1 e+  2  3.08 3.22e+  0  19.4 1   e+  0 0         3   e+  0 1   e+  0
 5 18.7  8   e+  0 3.6 e+  2 1.75e+  2  3.15 3.44e+  0  17.0 0         0         3   e+  0 2   e+  0
 6 18.1  6   e+  0 2.25e+  2 1.05e+  2  2.76 3.46e+  0  20.2 1   e+  0 0         3   e+  0 1   e+  0
 7 14.3  8   e+  0 3.6 e+  2 2.45e+  2  3.21 3.57e+  0  15.8 0         0         3   e+  0 4   e+  0
 8 24.4  4   e+  0 1.47e+  2 6.2 e+  1  3.69 3.19e+  0  20   1   e+  0 0         4   e+  0 2   e+  0
 9 22.8  4   e+  0 1.41e+  2 9.5 e+  1  3.92 3.15e+  0  22.9 1   e+  0 0         4   e+  0 2   e+  0
10 19.2  6   e+  0 1.68e+  2 1.23e+  2  3.92 3.44e+  0  18.3 1   e+  0 0         4   e+  0 4   e+  0
# ... with 22 more rows
> 
> (call_3 <- tbl(db_con, "mtcars_mod") %>% 
+     mutate(mpg = as.character(mpg)) %>% 
+     collect())
# A tibble: 32 x 11
   mpg     cyl  disp    hp   drat     wt  qsec    vs    am  gear  carb
   <chr> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 NA        0  NaN    NaN NaN    NaN    NaN     NaN   NaN   NaN   NaN
 2 21        6  160    110   3.9    2.88  17.0     0     1     4     4
 3 22.8      4  108     93   3.85   2.32  18.6     1     1     4     1
 4 21.4      6  258    110   3.08   3.22  19.4     1     0     3     1
 5 18.7      8  360    175   3.15   3.44  17.0     0     0     3     2
 6 18.1      6  225    105   2.76   3.46  20.2     1     0     3     1
 7 14.3      8  360    245   3.21   3.57  15.8     0     0     3     4
 8 24.4      4  147.    62   3.69   3.19  20       1     0     4     2
 9 22.8      4  141.    95   3.92   3.15  22.9     1     0     4     2
10 19.2      6  168.   123   3.92   3.44  18.3     1     0     4     4
# ... with 22 more rows

It can be seen that with different calls that the variable expression in the R environment changes on each call, even though I am only modifying one columns variable type (numeric to character).

Further, when I take the SQL query generated by show_query() and run directly in MSSQL the issue is not present. It appears to be a direct issue with R and/or dplyr::tbl.

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] tibble_3.0.6  keyring_1.1.0 purrr_0.3.4   odbc_1.3.0    DBI_1.1.1     dplyr_1.0.7  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7          rstudioapi_0.13     magrittr_2.0.1      hms_1.0.0           tidyselect_1.1.0    bit_4.0.4           R6_2.5.0           
 [8] rlang_0.4.10        fansi_0.4.2         blob_1.2.1          tools_4.0.3         utf8_1.1.4          cli_2.3.0           withr_2.4.1        
[15] dbplyr_2.1.1        remotes_2.4.0       ellipsis_0.3.2      bit64_4.0.5         assertthat_0.2.1    lifecycle_1.0.0     neon_0.1.20        
[22] crayon_1.4.0        BiocManager_1.30.16 vctrs_0.3.8         glue_1.4.2          compiler_4.0.3      pillar_1.6.2        generics_0.1.0     
[29] lubridate_1.7.10    renv_0.12.5         pkgconfig_2.0.3 

Solution

  • This was an issue with ODBC package not dplyr.

    updating from 1.3.0 to 1.3.2 fixes this. Please see here: https://cran.r-project.org/web/packages/odbc/news/news.html