Search code examples
rfile-ioreadracrosstidyselect

Specify a column type across multiple columns with tidy-selection in readr package


I attempt to use read_csv from {readr} to read a CSV file into R. To demonstrate my real issue, I reset the argument guess_max to 5 at first (default is 1000)

library(readr)
formals(read_csv)$guess_max <- 5

and take a smaller literal data for example:

csv <- I(
"ID, Col1, Col2, VarA, VarB, VarC
1, NA, NA, NA, NA, NA
2, NA, NA, NA, NA, NA
3, NA, NA, NA, NA, NA
4, NA, NA, NA, NA, NA
5, 0, 1, x, y, z
6, NA, NA, NA, NA, NA")

read_csv(csv)

# # A tibble: 6 × 6
#      ID  Col1    Col2    VarA   VarB   VarC 
#   <dbl>  <lgl>   <lgl>   <lgl>  <lgl>  <lgl>
# 1     1  NA      NA      NA     NA     NA   
# 2     2  NA      NA      NA     NA     NA   
# 3     3  NA      NA      NA     NA     NA   
# 4     4  NA      NA      NA     NA     NA   
# 5     5  FALSE*  TRUE*   NA*    NA*    NA*
# 6     6  NA      NA      NA     NA     NA

*: parsing issues occur


Affected by guess_max, only the first 5 lines (column names and ID 1 to 4) are used for guessing column types. Because the values in ID 1 to 4 are all missing, all columns are guessed as logical and are parsed incorrectly:

  • 0, 1 (integer) → FALSE, TRUE (logical)
  • 'x', 'y', 'z' (character) → NA (logical)

In this case I have to set col_types manually:

read_csv(csv, col_types = cols(Col1 = col_integer(), Col2 = col_integer(),
                               VarA = col_character(), VarB = col_character(), VarC = col_character()))

# # A tibble: 6 × 6                                                                                                   
#      ID  Col1  Col2 VarA  VarB  VarC 
#   <dbl> <int> <int> <chr> <chr> <chr>
# 1     1    NA    NA NA    NA    NA   
# 2     2    NA    NA NA    NA    NA   
# 3     3    NA    NA NA    NA    NA   
# 4     4    NA    NA NA    NA    NA   
# 5     5     0     1 x     y     z    
# 6     6    NA    NA NA    NA    NA

Supplying the column types one by one is annoying when there are much more columns. If the names of those columns I want to specify have some patterns, I expect to use the <tidy-select>-like syntax to specify a type across multiple columns, such as across() in {dplyr}. The pseudocode is like:

read_csv(csv, col_types = cols(across(starts_with("Col"), col_integer()),
                               across(starts_with("Var"), col_character())))

Is it possible by readr itself or other add-on packages?

Thanks in advance!


Edits

I need to use col_xxx() rather than their abbreviations ('i', 'c', etc.) to create column specification for more general purpose, e.g.

cols(across(contains("Date"), col_date(format = "%m-%d-%Y")),
     across(Fct1:Fct9, col_factor(levels = custom_levels)))

Solution

  • The read_delim() family has used tidy-selection for selecting columns with the argument col_select. You can take advantage of this argument to incorporate tidy-selection into the specification of column types. The following is a simple implementation. The key is setting n_max = 0L to only read the line of column names.

    Version 1

    col_across <- function(.cols, .fns, file) {
      col_selected <- read_csv(file, n_max = 0L, col_select = {{.cols}}, show_col_types = FALSE)
      lapply(col_selected, function(x) .fns)
    }
    
    Usage
    df <- read_csv(csv,
      col_types = c(col_across(starts_with("Col"), col_integer(), csv),
                    col_across(VarA:VarC, col_factor(c('x', 'y', 'z')), csv))
    )
    

    The way above is simple but just passable with some drawbacks:

    1. The same file source (i.e. object csv) needs to be passed to every col_across().
    2. The read_delim family includes several variants, e.g. read_csv, read_csv2, read_tsv. When calling df <- read_xxx(...), you have to confirm that col_across() has used the consistent read_xxx to read column names.


    Version 2

    An improved version of col_across is developed that automatically detects which read_xxx is used, and retrieves the file source from the outer call.

    col_across <- function(.cols, .fns) {
      sc <- sys.call(1L)
      sc <- match.call(match.fun(sc[[1L]]), sc)
      read_call <- sc[c(1L, match("file", names(sc), 0L))]
      read_call$n_max <- 0L
      read_call$col_select <- substitute(.cols)
      read_call$show_col_types <- FALSE
      lapply(eval(read_call, parent.frame()), function(x) .fns)
    }
    
    Usage
    df <- read_csv(csv,
      col_types = c(col_across(starts_with("Col"), col_integer()),
                    col_across(VarA:VarC, col_factor(c('x', 'y', 'z'))))
    )
    

    Note that this version of col_across must only be used inside the read_delim family, as across is to mutate in {dplyr}.

    Examine the column specifications
    spec(df)
    
    # cols(
    #   ID = col_double(),
    #   Col1 = col_integer(),
    #   Col2 = col_integer(),
    #   VarA = col_factor(levels = c("x", "y", "z"), ordered = FALSE, include_na = FALSE),
    #   VarB = col_factor(levels = c("x", "y", "z"), ordered = FALSE, include_na = FALSE),
    #   VarC = col_factor(levels = c("x", "y", "z"), ordered = FALSE, include_na = FALSE)
    # )