I attempt to use read_csv
from {readr}
to read a CSV
file into R. To demonstrate my real issue, I reset the argument guess_max
to 5 at first (default is 1000)
library(readr)
formals(read_csv)$guess_max <- 5
and take a smaller literal data for example:
csv <- I(
"ID, Col1, Col2, VarA, VarB, VarC
1, NA, NA, NA, NA, NA
2, NA, NA, NA, NA, NA
3, NA, NA, NA, NA, NA
4, NA, NA, NA, NA, NA
5, 0, 1, x, y, z
6, NA, NA, NA, NA, NA")
read_csv(csv)
# # A tibble: 6 × 6
# ID Col1 Col2 VarA VarB VarC
# <dbl> <lgl> <lgl> <lgl> <lgl> <lgl>
# 1 1 NA NA NA NA NA
# 2 2 NA NA NA NA NA
# 3 3 NA NA NA NA NA
# 4 4 NA NA NA NA NA
# 5 5 FALSE* TRUE* NA* NA* NA*
# 6 6 NA NA NA NA NA
*
: parsing issues occur
Affected by guess_max
, only the first 5 lines (column names and ID
1 to 4) are used for guessing column types. Because the values in ID
1 to 4 are all missing, all columns are guessed as logical
and are parsed incorrectly:
0
, 1
(integer) → FALSE
, TRUE
(logical)'x'
, 'y'
, 'z'
(character) → NA
(logical)In this case I have to set col_types
manually:
read_csv(csv, col_types = cols(Col1 = col_integer(), Col2 = col_integer(),
VarA = col_character(), VarB = col_character(), VarC = col_character()))
# # A tibble: 6 × 6
# ID Col1 Col2 VarA VarB VarC
# <dbl> <int> <int> <chr> <chr> <chr>
# 1 1 NA NA NA NA NA
# 2 2 NA NA NA NA NA
# 3 3 NA NA NA NA NA
# 4 4 NA NA NA NA NA
# 5 5 0 1 x y z
# 6 6 NA NA NA NA NA
Supplying the column types one by one is annoying when there are much more columns. If the names of those columns I want to specify have some patterns, I expect to use the <tidy-select>-like syntax to specify a type across multiple columns, such as across()
in {dplyr}
. The pseudocode is like:
read_csv(csv, col_types = cols(across(starts_with("Col"), col_integer()),
across(starts_with("Var"), col_character())))
Is it possible by readr
itself or other add-on packages?
Thanks in advance!
I need to use col_xxx()
rather than their abbreviations ('i'
, 'c'
, etc.) to create column specification for more general purpose, e.g.
cols(across(contains("Date"), col_date(format = "%m-%d-%Y")),
across(Fct1:Fct9, col_factor(levels = custom_levels)))
The read_delim()
family has used tidy-selection for selecting columns with the argument col_select
. You can take advantage of this argument to incorporate tidy-selection into the specification of column types. The following is a simple implementation. The key is setting n_max = 0L
to only read the line of column names.
col_across <- function(.cols, .fns, file) {
col_selected <- read_csv(file, n_max = 0L, col_select = {{.cols}}, show_col_types = FALSE)
lapply(col_selected, function(x) .fns)
}
df <- read_csv(csv,
col_types = c(col_across(starts_with("Col"), col_integer(), csv),
col_across(VarA:VarC, col_factor(c('x', 'y', 'z')), csv))
)
The way above is simple but just passable with some drawbacks:
csv
) needs to be passed to every col_across()
.read_delim
family includes several variants, e.g. read_csv
, read_csv2
, read_tsv
. When calling df <- read_xxx(...)
, you have to confirm that col_across()
has used the consistent read_xxx
to read column names.An improved version of col_across
is developed that automatically detects which read_xxx
is used, and retrieves the file source from the outer call.
col_across <- function(.cols, .fns) {
sc <- sys.call(1L)
sc <- match.call(match.fun(sc[[1L]]), sc)
read_call <- sc[c(1L, match("file", names(sc), 0L))]
read_call$n_max <- 0L
read_call$col_select <- substitute(.cols)
read_call$show_col_types <- FALSE
lapply(eval(read_call, parent.frame()), function(x) .fns)
}
df <- read_csv(csv,
col_types = c(col_across(starts_with("Col"), col_integer()),
col_across(VarA:VarC, col_factor(c('x', 'y', 'z'))))
)
Note that this version of
col_across
must only be used inside theread_delim
family, asacross
is tomutate
in{dplyr}
.
spec(df)
# cols(
# ID = col_double(),
# Col1 = col_integer(),
# Col2 = col_integer(),
# VarA = col_factor(levels = c("x", "y", "z"), ordered = FALSE, include_na = FALSE),
# VarB = col_factor(levels = c("x", "y", "z"), ordered = FALSE, include_na = FALSE),
# VarC = col_factor(levels = c("x", "y", "z"), ordered = FALSE, include_na = FALSE)
# )