Having csv-files with the European number format style (1234.56 -> 1.234,56) should be handeled by a readr
function or fread()
. Even though read_csv2()
should be exactly designed for this task, it basically ignores the specification. It only guesses the number formatting automatically. This is problematic if the first numbers with more than 3 digits appear only at the end of the file, i.e. after guess_max
is reached (1000 by default).
How can I enforce the correct formatting programmatically?
library(readr)
data <- data.frame(var1 = c("", 4, 5, "124.392,45"),
var2 = c(1, 2, "4.783.194,43", 7))
write_csv2(data, "data.csv")
read_csv2("data.csv", guess_max = 2,
locale = locale(decimal_mark = ",", grouping_mark = "."))
# # A tibble: 4 x 2
# var1 var2
# <dbl> <dbl>
# 1 NA 1
# 2 4 2
# 3 5 NA
# 4 NA 7
read_csv2("data.csv", guess_max = 3,
locale = locale(decimal_mark = ",", grouping_mark = "."))
# # A tibble: 4 x 2
# var1 var2
# <dbl> <dbl>
# 1 NA 1
# 2 4 2
# 3 5 4783194.
# 4 NA 7
read_delim("data.csv", delim = ";", guess_max = 3,
locale = locale(decimal_mark = ",", grouping_mark = "."))
# # A tibble: 4 x 2
# var1 var2
# <dbl> <dbl>
# 1 NA 1
# 2 4 2
# 3 5 4783194.
# 4 NA 7
Setting the col_types
beforehand seems to help. In this case numeric.
col_number() [n], numbers containing the grouping_mark
result <- read_csv2("data.csv",
# guess_max = 2, not needed if col_types are specified
col_types = cols(var1 = col_number(),
var2 = col_number()),
locale = locale(decimal_mark = ",", grouping_mark = "."))
result
# A tibble: 4 x 2
var1 var2
<dbl> <dbl>
1 NA 1
2 4 2
3 5 4783194.
4 124392. 7
As Adam pointed out, if you set the col_types, no need for guessing as col_types needs to be the same length as the columns you want to read in.