Search code examples
rtidyversereadr

read_csv use col_double() instead for all numeric columns


I would like to use readr::read_csv instead of read.csv, due to its speed and automatically converting dates. However there is one problem with how it handles numbers that are mostly integers, with a few floats sprinkled in.

Is there a way to force it to use col_double for all numbers, while still using col_guess for all other columns?

To me it seems like guessing col_integer is a suboptimal choice by the package developer. This seems to happen a lot with real data for me. For example, when non-zeroes are rare.

I am opening files without knowing the column types or names ahead of time.

Here is an illustration of the issue:

df<-data.frame(
    i=as.integer(c(1:5)),
    d=seq.Date(as.Date('2019-01-01'), length.out = 5, by=1),
    mix = c('1','2','3.1','4','5'),
    stringsAsFactors = F
)%>%as.tbl
write_csv(df, '~/temp.csv')


this is good! The 3.1 value is read in correctly.

read_csv('~/temp.csv')
# A tibble: 5 x 3
      i d            mix
  <int> <date>     <dbl>
1     1 2019-01-01   1  
2     2 2019-01-02   2  
3     3 2019-01-03   3.1
4     4 2019-01-04   4  
5     5 2019-01-05   5 


50k row data frame, with a decimal place value in later rows.

df_large <-data.frame(
    i = as.integer(c(1:(1e4))),
    d=seq.Date(as.Date('2019-01-01'), length.out = 1e4, by=1),
    mix = as.character(c(1:(1e4))),
    stringsAsFactors = F
)%>%as.tbl
bind_rows(df_large, df)%>%tail(7)
# A tibble: 7 x 3
      i d          mix  
  <int> <date>     <chr>
1  9999 2046-05-17 9999 
2 10000 2046-05-18 10000
3     1 2019-01-01 1    
4     2 2019-01-02 2    
5     3 2019-01-03 3.1  
6     4 2019-01-04 4    
7     5 2019-01-05 5   


this is BAD! The 3.1 is now NA.

bind_rows(df_large, df)%>%write_csv(., '~/temp.csv')
read_csv('~/temp.csv')%>%tail(7)
# A tibble: 7 x 3
      i d            mix
  <int> <date>     <int>
1  9999 2046-05-17  9999
2 10000 2046-05-18 10000
3     1 2019-01-01     1
4     2 2019-01-02     2
5     3 2019-01-03    NA
6     4 2019-01-04     4
7     5 2019-01-05     5


this works, but how to i set guess_max ahead of time.

read_csv('~/temp.csv', guess_max = 1e5)%>%as.tbl%>%tail(7)
# A tibble: 7 x 3
      i d              mix
  <int> <date>       <dbl>
1  9999 2046-05-17  9999  
2 10000 2046-05-18 10000  
3     1 2019-01-01     1  
4     2 2019-01-02     2  
5     3 2019-01-03     3.1
6     4 2019-01-04     4  
7     5 2019-01-05     5  


As the guess_max grows, so does run time. It seems to be oversampling.

system.time(read_csv('~/temp.csv', guess_max = 1e5)%>%as.tbl%>%tail(7))
   user  system elapsed 
  0.020   0.001   0.022 
system.time(read_csv('~/temp.csv', guess_max = 1e7)%>%as.tbl%>%tail(7))
   user  system elapsed 
  0.321   0.010   0.332 
system.time(read_csv('~/temp.csv', guess_max = 1e9)%>%as.tbl%>%tail(7))
   user  system elapsed 
 34.138   5.848  39.821 


This works, but there can be >30 columns, and i don't types ahead of time.

read_csv('~/temp.csv', col_types = 'dDd')%>%as.tbl%>%tail(7)


data.table::fread is quick, handles numerics well, but doesn't convert dates.

data.table::fread('~/temp.csv')%>%as.tbl%>%tail(7)
# A tibble: 7 x 3
      i d              mix
  <int> <chr>        <dbl>
1  9999 2046-05-17  9999  
2 10000 2046-05-18 10000  
3     1 2019-01-01     1  
4     2 2019-01-02     2  
5     3 2019-01-03     3.1
6     4 2019-01-04     4  
7     5 2019-01-05     5 

Solution

  • One thing you could do is read in the first row of the CSV (or first n number of rows), find out which columns are being parsed as integers, and pass those as arguments to cols:

    library(readr)
    
    read_csv_dbl <- function(file, ...){
      types <- sapply(suppressMessages(read_csv(file, n_max = 1)), class) 
      int_cols <- names(types[types == "integer"])
      args <- structure(replicate(length(int_cols), col_double()), names = int_cols)
      read_csv(file, col_types = do.call(cols, args), ...)
    }
    
    read_csv_dbl("~/temp.csv") %>% tail(7)
    # A tibble: 7 x 3
    #      i d              mix
    #  <dbl> <date>       <dbl>
    #1  9999 2046-05-17  9999  
    #2 10000 2046-05-18 10000  
    #3     1 2019-01-01     1  
    #4     2 2019-01-02     2  
    #5     3 2019-01-03     3.1
    #6     4 2019-01-04     4  
    #7     5 2019-01-05     5  
    

    This method is also much faster than changing guess_max:

    system.time(read_csv_dbl("~/temp.csv"))
    #   user  system elapsed 
    #   0.02    0.00    0.01