I just noticed that read_csv()
somehow uses random numbers which is unexpected (at least to me). The corresponding base R function read.csv()
does not do that. So, what does read_csv()
use the random numbers for? I looked into the documentation but could not find a clear answer to that. Are the random numbers related to the guess_max
argument?
library(tidyverse)
set.seed(123)
rnorm(1)
# [1] -0.5604756
set.seed(123)
dat <- read.csv("data/titanic.csv")
rnorm(1)
# [1] -0.5604756
set.seed(123)
dat <- read_csv("data/titanic.csv")
rnorm(1)
#[1] 1.239496
EDIT:
col_types
and indeed it worked. But still I wonder why this is happening. Anyone got an explanation?set.seed(123)
dat <- read_csv("data/titanic.csv", col_types = c("dddccdddcdcc"))
rnorm(1)
#[1] -0.5604756
readr
version, here is my session info.library(readr)
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.0.5 (2021-03-31)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate German_Germany.1252
#> ctype German_Germany.1252
#> tz Europe/Berlin
#> date 2021-06-10
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> cli 2.5.0 2021-04-26 [1] CRAN (R 4.0.3)
#> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.4)
#> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.0.3)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.3)
#> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.0.5)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.0.5)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.3)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.3)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.0.5)
#> hms 1.0.0 2021-01-13 [1] CRAN (R 4.0.5)
#> htmltools 0.5.1.9003 2021-05-07 [1] Github (rstudio/htmltools@e12171e)
#> knitr 1.33 2021-04-24 [1] CRAN (R 4.0.5)
#> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.4)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3)
#> pillar 1.6.1 2021-05-16 [1] CRAN (R 4.0.5)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.3)
#> ps 1.6.0 2021-02-28 [1] CRAN (R 4.0.5)
#> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3)
#> readr * 1.4.0 2020-10-05 [1] CRAN (R 4.0.5)
#> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.0.5)
#> rlang 0.4.11.9000 2021-05-29 [1] Github (r-lib/rlang@7797cdf)
#> rmarkdown 2.8.1 2021-05-07 [1] Github (rstudio/rmarkdown@e98207f)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.3)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.3)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.3)
#> tibble 3.1.2 2021-05-16 [1] CRAN (R 4.0.5)
#> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.3)
#> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.0.3)
#> withr 2.4.2 2021-04-18 [1] CRAN (R 4.0.5)
#> xfun 0.22 2021-03-11 [1] CRAN (R 4.0.5)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.3)
#>
#> [1] C:/Users/Albert/Documents/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.5/library
Created on 2021-06-10 by the reprex package (v2.0.0)
tl;dr somewhere deep in the guts of the cli
package (called to generate the pretty-printed output about column types), the code is generating a random string to use as a label.
A major clue is that
set.seed(123); dat <- read_csv("iris.csv", col_types=cols()); rnorm(1)
runs read_csv
guessing the column types but without printing information about the guesses; this doesn't hit the RNG, which makes me think it's something in the fancy colour printing.
By making a copy of the random seed info (R <- .Random.seed
) and stepping through the code (debug(readr::show_cols_spec)
) and periodically running identical(R, .Random.seed)
to check on the status, I found that the random info changes after running
cli::cli_h1("Column specification")
Debugging into that function, the change occurs somewhere in cli::cli__message
; specifically, right before we execute this line
if ("id" %in% names(args) && is.null(args$id)) args$id <- new_uuid()
(which is here in the source code of cli
), identical(R, .Random.seed)
is still TRUE; after running it, it's FALSE. More specifically, all we have to do at this point is evaluate the args
argument (e.g. by typing args
in the debugger).
Working our way back up the chain and trying things by hand, we can see that manually evaluating
glue_cmd(text, .envir = .envir)
at this point in the code changes the random info.
Still more stepping through takes us to a point within glue_cmd
where we call make_cmd_transformer
where at this point we call a function called random_id()
:
values$marker <- random_id()
random_id()
then calls sample
...
I have no idea why this internal bit of cli
needs to be generating a random string, but I guess you could ask the maintainers?
This was done using readr
1.4.0 and cli
2.5.0 (although the code references are to the current version on GitHub [10 June 2021]).