Search code examples
rtidyversereadr

How to tell readr::read_csv to guess double column correctly


I have runoff data with a lot of zero values and occasionally some non-zero double values.

'readr::read_csv' guesses integer column type because of the many zeros.

How can I make read_csv to guess the correct double column type? I do not know the mapping of the variable names beforehand, hence I cannot give name-type mapping.

Here is a small example

  # create a column of doubles with many zeros (runoff data)
  #dsTmp <- data.frame(x = c(rep(0.0, 2), 0.5)) # this works
  dsTmp <- data.frame(x = c(rep(0.0, 1e5), 0.5))
  write_csv(dsTmp, "tmp/dsTmp.csv")
  # 0.0 is written as 0 
  # read_csv now guesses integer instead of double and reports 
  # a parsing failure. 
  ans <- read_csv("tmp/dsTmp.csv")
  # the last value is NA instead of 0.5
  tail(ans)

Can I tell it to choose a try wider column types instead of issuing a parsing failure?

Issue 645 mentions this problem, but the workaround given there is on the writing side. I have little influence on the writing side.


Solution

  • Here's two techniques. (Data prep at the bottom. $hp and $vs and beyond are the integer columns.)

    NB: I add cols(.default=col_guess()) to most of the first-time calls so that we don't get the large message of what read_csv found the columns to be. It can be omitted at the cost of a more noisy console.

    1. Force all columns to be double, with the cols(.default=...) setting, works safely as long as you know there are no non-numbers in the file:

      read_csv("mtcars.csv", col_types = cols(.default = col_double()))
      # Warning in rbind(names(probs), probs_f) :
      #   number of columns of result is not a multiple of vector length (arg 1)
      # Warning: 32 parsing failures.
      ### ...snip...
      # See problems(...) for more details.
      # # A tibble: 32 x 11
      #      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
      #    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
      #  1  21      NA  160    110  3.9   2.62  16.5     0     1     4     4
      #  2  21      NA  160    110  3.9   2.88  17.0     0     1     4     4
      #  3  22.8    NA  108     93  3.85  2.32  18.6     1     1     4     1
      #  4  21.4    NA  258    110  3.08  3.22  19.4     1     0     3     1
      #  5  18.7    NA  360    175  3.15  3.44  17.0     0     0     3     2
      #  6  18.1    NA  225    105  2.76  3.46  20.2     1     0     3     1
      #  7  14.3    NA  360    245  3.21  3.57  15.8     0     0     3     4
      #  8  24.4    NA  147.    62  3.69  3.19  20       1     0     4     2
      #  9  22.8    NA  141.    95  3.92  3.15  22.9     1     0     4     2
      # 10  19.2    NA  168.   123  3.92  3.44  18.3     1     0     4     4
      # # ... with 22 more rows
      
    2. Change only <int> (col_integer()) columns, taking a little more care. My use of n_max=50 needs to be balanced. Similar to guess_max=, a little more is better. In this case, if I chose n_max=1 then the top couple of mpg values would suggest integers, which is fine. But if you have other fields that are ambiguous with other classes, you'll need more. Since you're talking about not wanting to read in the whole file but are willing to read in "a bit" to get the right guess, I'd think you can go with a reasonable value (100s? 1000s?) here to be robust for chr and lgl.

      types <- attr(read_csv("mtcars.csv", n_max=1, col_types = cols(.default = col_guess())), "spec")
      (intcols <- sapply(types$cols, identical, col_integer()))
      #   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb 
      #  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE 
      types$cols[intcols] <- replicate(sum(intcols), col_double())
      

      and the final read, noting that $hp and beyond are now <dbl> (unlike in the data prep read, below).

      read_csv("mtcars.csv", col_types = types)
      # # A tibble: 32 x 11
      #      mpg cyl    disp    hp  drat    wt  qsec    vs    am  gear  carb
      #    <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
      #  1  21   c6     160    110  3.9   2.62  16.5     0     1     4     4
      #  2  21   c6     160    110  3.9   2.88  17.0     0     1     4     4
      #  3  22.8 c4     108     93  3.85  2.32  18.6     1     1     4     1
      #  4  21.4 c6     258    110  3.08  3.22  19.4     1     0     3     1
      #  5  18.7 c8     360    175  3.15  3.44  17.0     0     0     3     2
      #  6  18.1 c6     225    105  2.76  3.46  20.2     1     0     3     1
      #  7  14.3 c8     360    245  3.21  3.57  15.8     0     0     3     4
      #  8  24.4 c4     147.    62  3.69  3.19  20       1     0     4     2
      #  9  22.8 c4     141.    95  3.92  3.15  22.9     1     0     4     2
      # 10  19.2 c6     168.   123  3.92  3.44  18.3     1     0     4     4
      # # ... with 22 more rows
      

    Data:

    library(readr)
    mt <- mtcars
    mt$cyl <- paste0("c", mt$cyl) # for fun
    write_csv(mt, path = "mtcars.csv")
    read_csv("mtcars.csv", col_types = cols(.default = col_guess()))
    # # A tibble: 32 x 11
    #      mpg cyl    disp    hp  drat    wt  qsec    vs    am  gear  carb
    #    <dbl> <chr> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
    #  1  21   c6     160    110  3.9   2.62  16.5     0     1     4     4
    #  2  21   c6     160    110  3.9   2.88  17.0     0     1     4     4
    #  3  22.8 c4     108     93  3.85  2.32  18.6     1     1     4     1
    #  4  21.4 c6     258    110  3.08  3.22  19.4     1     0     3     1
    #  5  18.7 c8     360    175  3.15  3.44  17.0     0     0     3     2
    #  6  18.1 c6     225    105  2.76  3.46  20.2     1     0     3     1
    #  7  14.3 c8     360    245  3.21  3.57  15.8     0     0     3     4
    #  8  24.4 c4     147.    62  3.69  3.19  20       1     0     4     2
    #  9  22.8 c4     141.    95  3.92  3.15  22.9     1     0     4     2
    # 10  19.2 c6     168.   123  3.92  3.44  18.3     1     0     4     4
    # # ... with 22 more rows