Search code examples
rcsvfreadreadr

How to enforce readr to consider correct decimal/grouping mark?


Having csv-files with the European number format style (1234.56 -> 1.234,56) should be handeled by a readr function or fread(). Even though read_csv2() should be exactly designed for this task, it basically ignores the specification. It only guesses the number formatting automatically. This is problematic if the first numbers with more than 3 digits appear only at the end of the file, i.e. after guess_max is reached (1000 by default).

How can I enforce the correct formatting programmatically?

library(readr)

data <- data.frame(var1 = c("", 4, 5, "124.392,45"),
                   var2 = c(1, 2, "4.783.194,43", 7))
write_csv2(data, "data.csv")

read_csv2("data.csv", guess_max = 2, 
          locale = locale(decimal_mark = ",", grouping_mark = "."))
# # A tibble: 4 x 2
#   var1  var2
#   <dbl> <dbl>
# 1    NA     1
# 2     4     2
# 3     5    NA
# 4    NA     7

read_csv2("data.csv", guess_max = 3, 
          locale = locale(decimal_mark = ",", grouping_mark = "."))
# # A tibble: 4 x 2
#   var1  var2
#   <dbl> <dbl>
# 1    NA     1
# 2     4     2
# 3     5    4783194.
# 4    NA     7

read_delim("data.csv", delim = ";", guess_max = 3, 
          locale = locale(decimal_mark = ",", grouping_mark = "."))
# # A tibble: 4 x 2
#   var1  var2
#   <dbl> <dbl>
# 1    NA     1
# 2     4     2
# 3     5    4783194.
# 4    NA     7

Solution

  • Setting the col_types beforehand seems to help. In this case numeric.

    col_number() [n], numbers containing the grouping_mark

    result <- read_csv2("data.csv", 
              # guess_max = 2, not needed if col_types are specified
              col_types = cols(var1 = col_number(),
                               var2 = col_number()),
              locale = locale(decimal_mark = ",", grouping_mark = "."))
    
    result
    # A tibble: 4 x 2
         var1     var2
        <dbl>    <dbl>
    1     NA        1 
    2      4        2 
    3      5  4783194.
    4 124392.       7 
    

    As Adam pointed out, if you set the col_types, no need for guessing as col_types needs to be the same length as the columns you want to read in.