Search code examples
rreadr

How to specify column types with abbreviations when skipping columns with read_csv


I would like to read in selected columns from a CSV file, using abbreviations supported by the cols function in the readr package. However, when I skip columns, readr tries to guess the column type, rather than using my specification, unless I specify the columns by name or set a default.

Here's a reproducible example:

library(tidyverse)

out <- tibble(a = c(1234, 5678),
       b = c(9876, 5432),
       c = c(4321, 8901))

write_csv(out, "test.csv")

test <- read_csv("test.csv",
                 col_select = c(a, c),
                 col_types = "cc")

typeof(test$c)
#> [1] "double"

I can get the correct specification by explicitly indicating the column name:

test2 <- read_csv("test.csv",
                 col_select = c(a, c),
                 col_types = c(a = "c", c = "c"))
typeof(test2$c)
#> [1] "character"

I can also get the correct specification by setting character as the default, as suggested in this Q&A. But I'm wondering if there is a way to get the correct specification using the abbreviation "cc" or -- alternatively -- how to generate an abbreviation string based on the columns that were skipped. My real use case involves a large number of skipped columns, so I don't want to use - or _ to specify the skipped columns.


Solution

  • Sorry, I've rewritten what I wrote earlier to be more clear based on an assumed understanding of what you are asking.

    If you want to get the col_types for the columns in your csv file prior to any skipping or manual changes then the easiest thing to do is to use the spec_csv() argument of your file which generate a col class text that will show you how read_csv() will classify each column type.

    From there you can copy, paste and edit that into your col_types argument to only bring in the columns & column types that you want. That can be done using the cols_only() argument instead of cols().

    spec_csv("test.csv")
    

    This will automatically generate in your output console:

    cols(
      a = col_double(),
      b = col_double(),
      c = col_double()
    )
    

    The output will tell you what the default reader column types would be (PS you can manipulate the spec_csv() argument just like the read_csv argument to increase the guess size eg.guess_max for the column types.

    #manually copied and pasted the above output, changed the default to the desired type and deleted the columns I didn't want
    
    read_csv("test.csv",
             col_types=cols_only(a = col_character(),
                                 c = col_character())
      )
    

    I used the long form (col_character) but you can instead you the abbreviation as you already indicated earlier.

    Please let me know if this is what you were asking or if there is any clarity that I can provide.