Search code examples
rreadr

What does read_csv() use random numbers for?


I just noticed that read_csv() somehow uses random numbers which is unexpected (at least to me). The corresponding base R function read.csv() does not do that. So, what does read_csv() use the random numbers for? I looked into the documentation but could not find a clear answer to that. Are the random numbers related to the guess_max argument?

library(tidyverse)
set.seed(123)
rnorm(1)
# [1] -0.5604756

set.seed(123)
dat <- read.csv("data/titanic.csv")
rnorm(1)
# [1] -0.5604756

set.seed(123)
dat <- read_csv("data/titanic.csv")
rnorm(1)
#[1] 1.239496

EDIT:

  1. As suggested by rawr's comment, I tried specifying col_types and indeed it worked. But still I wonder why this is happening. Anyone got an explanation?
set.seed(123)
dat <- read_csv("data/titanic.csv", col_types = c("dddccdddcdcc"))
rnorm(1)
#[1] -0.5604756
  1. Since a lot of people asked about the R and readr version, here is my session info.
library(readr)
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.0.5 (2021-03-31)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Germany.1252         
#>  ctype    German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2021-06-10                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version     date       lib source                            
#>  cli           2.5.0       2021-04-26 [1] CRAN (R 4.0.3)                    
#>  crayon        1.4.1       2021-02-08 [1] CRAN (R 4.0.4)                    
#>  digest        0.6.27      2020-10-24 [1] CRAN (R 4.0.3)                    
#>  ellipsis      0.3.2       2021-04-29 [1] CRAN (R 4.0.3)                    
#>  evaluate      0.14        2019-05-28 [1] CRAN (R 4.0.3)                    
#>  fansi         0.5.0       2021-05-25 [1] CRAN (R 4.0.5)                    
#>  fastmap       1.1.0       2021-01-25 [1] CRAN (R 4.0.5)                    
#>  fs            1.5.0       2020-07-31 [1] CRAN (R 4.0.3)                    
#>  glue          1.4.2       2020-08-27 [1] CRAN (R 4.0.3)                    
#>  highr         0.9         2021-04-16 [1] CRAN (R 4.0.5)                    
#>  hms           1.0.0       2021-01-13 [1] CRAN (R 4.0.5)                    
#>  htmltools     0.5.1.9003  2021-05-07 [1] Github (rstudio/htmltools@e12171e)
#>  knitr         1.33        2021-04-24 [1] CRAN (R 4.0.5)                    
#>  lifecycle     1.0.0       2021-02-15 [1] CRAN (R 4.0.4)                    
#>  magrittr      2.0.1       2020-11-17 [1] CRAN (R 4.0.3)                    
#>  pillar        1.6.1       2021-05-16 [1] CRAN (R 4.0.5)                    
#>  pkgconfig     2.0.3       2019-09-22 [1] CRAN (R 4.0.3)                    
#>  ps            1.6.0       2021-02-28 [1] CRAN (R 4.0.5)                    
#>  R6            2.5.0       2020-10-28 [1] CRAN (R 4.0.3)                    
#>  readr       * 1.4.0       2020-10-05 [1] CRAN (R 4.0.5)                    
#>  reprex        2.0.0       2021-04-02 [1] CRAN (R 4.0.5)                    
#>  rlang         0.4.11.9000 2021-05-29 [1] Github (r-lib/rlang@7797cdf)      
#>  rmarkdown     2.8.1       2021-05-07 [1] Github (rstudio/rmarkdown@e98207f)
#>  rstudioapi    0.13        2020-11-12 [1] CRAN (R 4.0.3)                    
#>  sessioninfo   1.1.1       2018-11-05 [1] CRAN (R 4.0.3)                    
#>  stringi       1.5.3       2020-09-09 [1] CRAN (R 4.0.3)                    
#>  stringr       1.4.0       2019-02-10 [1] CRAN (R 4.0.3)                    
#>  tibble        3.1.2       2021-05-16 [1] CRAN (R 4.0.5)                    
#>  utf8          1.2.1       2021-03-12 [1] CRAN (R 4.0.3)                    
#>  vctrs         0.3.8       2021-04-29 [1] CRAN (R 4.0.3)                    
#>  withr         2.4.2       2021-04-18 [1] CRAN (R 4.0.5)                    
#>  xfun          0.22        2021-03-11 [1] CRAN (R 4.0.5)                    
#>  yaml          2.2.1       2020-02-01 [1] CRAN (R 4.0.3)                    
#> 
#> [1] C:/Users/Albert/Documents/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.5/library

Created on 2021-06-10 by the reprex package (v2.0.0)


Solution

  • tl;dr somewhere deep in the guts of the cli package (called to generate the pretty-printed output about column types), the code is generating a random string to use as a label.


    A major clue is that

    set.seed(123); dat <- read_csv("iris.csv", col_types=cols()); rnorm(1)
    

    runs read_csv guessing the column types but without printing information about the guesses; this doesn't hit the RNG, which makes me think it's something in the fancy colour printing.

    By making a copy of the random seed info (R <- .Random.seed) and stepping through the code (debug(readr::show_cols_spec)) and periodically running identical(R, .Random.seed) to check on the status, I found that the random info changes after running

    cli::cli_h1("Column specification")
    

    Debugging into that function, the change occurs somewhere in cli::cli__message; specifically, right before we execute this line

     if ("id" %in% names(args) && is.null(args$id)) args$id <- new_uuid()
    

    (which is here in the source code of cli), identical(R, .Random.seed) is still TRUE; after running it, it's FALSE. More specifically, all we have to do at this point is evaluate the args argument (e.g. by typing args in the debugger).

    Working our way back up the chain and trying things by hand, we can see that manually evaluating

    glue_cmd(text, .envir = .envir)
    

    at this point in the code changes the random info.

    Still more stepping through takes us to a point within glue_cmd where we call make_cmd_transformer where at this point we call a function called random_id():

    values$marker <- random_id()
    

    random_id() then calls sample ...

    I have no idea why this internal bit of cli needs to be generating a random string, but I guess you could ask the maintainers?


    This was done using readr 1.4.0 and cli 2.5.0 (although the code references are to the current version on GitHub [10 June 2021]).