I have runoff data with a lot of zero values and occasionally some non-zero double values.
'readr::read_csv' guesses integer column type because of the many zeros.
How can I make read_csv to guess the correct double column type? I do not know the mapping of the variable names beforehand, hence I cannot give name-type mapping.
Here is a small example
# create a column of doubles with many zeros (runoff data)
#dsTmp <- data.frame(x = c(rep(0.0, 2), 0.5)) # this works
dsTmp <- data.frame(x = c(rep(0.0, 1e5), 0.5))
write_csv(dsTmp, "tmp/dsTmp.csv")
# 0.0 is written as 0
# read_csv now guesses integer instead of double and reports
# a parsing failure.
ans <- read_csv("tmp/dsTmp.csv")
# the last value is NA instead of 0.5
tail(ans)
Can I tell it to choose a try wider column types instead of issuing a parsing failure?
Issue 645 mentions this problem, but the workaround given there is on the writing side. I have little influence on the writing side.
Here's two techniques. (Data prep at the bottom. $hp
and $vs
and beyond are the integer columns.)
NB: I add cols(.default=col_guess())
to most of the first-time calls so that we don't get the large message of what read_csv
found the columns to be. It can be omitted at the cost of a more noisy console.
Force all columns to be double, with the cols(.default=...)
setting, works safely as long as you know there are no non-numbers in the file:
read_csv("mtcars.csv", col_types = cols(.default = col_double()))
# Warning in rbind(names(probs), probs_f) :
# number of columns of result is not a multiple of vector length (arg 1)
# Warning: 32 parsing failures.
### ...snip...
# See problems(...) for more details.
# # A tibble: 32 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 NA 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 NA 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 NA 108 93 3.85 2.32 18.6 1 1 4 1
# 4 21.4 NA 258 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 NA 360 175 3.15 3.44 17.0 0 0 3 2
# 6 18.1 NA 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 NA 360 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 NA 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 NA 141. 95 3.92 3.15 22.9 1 0 4 2
# 10 19.2 NA 168. 123 3.92 3.44 18.3 1 0 4 4
# # ... with 22 more rows
Change only <int>
(col_integer()
) columns, taking a little more care. My use of n_max=50
needs to be balanced. Similar to guess_max=
, a little more is better. In this case, if I chose n_max=1
then the top couple of mpg
values would suggest integers, which is fine. But if you have other fields that are ambiguous with other classes, you'll need more. Since you're talking about not wanting to read in the whole file but are willing to read in "a bit" to get the right guess, I'd think you can go with a reasonable value (100s? 1000s?) here to be robust for chr
and lgl
.
types <- attr(read_csv("mtcars.csv", n_max=1, col_types = cols(.default = col_guess())), "spec")
(intcols <- sapply(types$cols, identical, col_integer()))
# mpg cyl disp hp drat wt qsec vs am gear carb
# TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
types$cols[intcols] <- replicate(sum(intcols), col_double())
and the final read, noting that $hp
and beyond are now <dbl>
(unlike in the data prep read, below).
read_csv("mtcars.csv", col_types = types)
# # A tibble: 32 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 c6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 c6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 c4 108 93 3.85 2.32 18.6 1 1 4 1
# 4 21.4 c6 258 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 c8 360 175 3.15 3.44 17.0 0 0 3 2
# 6 18.1 c6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 c8 360 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 c4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 c4 141. 95 3.92 3.15 22.9 1 0 4 2
# 10 19.2 c6 168. 123 3.92 3.44 18.3 1 0 4 4
# # ... with 22 more rows
Data:
library(readr)
mt <- mtcars
mt$cyl <- paste0("c", mt$cyl) # for fun
write_csv(mt, path = "mtcars.csv")
read_csv("mtcars.csv", col_types = cols(.default = col_guess()))
# # A tibble: 32 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <chr> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
# 1 21 c6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 c6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 c4 108 93 3.85 2.32 18.6 1 1 4 1
# 4 21.4 c6 258 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 c8 360 175 3.15 3.44 17.0 0 0 3 2
# 6 18.1 c6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 c8 360 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 c4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 c4 141. 95 3.92 3.15 22.9 1 0 4 2
# 10 19.2 c6 168. 123 3.92 3.44 18.3 1 0 4 4
# # ... with 22 more rows