Search code examples
rreadr

Escaping strings with readr::read_csv


I have a CSV file where numbers are escaped in strings and the decimal mark is a comma. I am reasonably sure that I have used readr::read_cvs to parse such files before, although it is a few years back, but I am running into trouble now. And it mostly looks like "..." isn't considered a quoted string.

If I do something like this:

> # This will go okay, except for an unecessary warning
> read_csv(
+     'A, B, C,    D,      E
+     $1, a, a,  1.2%,  "1,100,200"
+     $2, b, b,  2.1%,  "  140,000"
+     $3, c, c, 13.0%,  "2,005,000"',
+     col_types = "nccnn"
+ )
# A tibble: 3 × 5                                                                                                                       
      A B     C         D       E
  <dbl> <chr> <chr> <dbl>   <dbl>
1     1 a     a       1.2 1100200
2     2 b     b       2.1  140000
3     3 c     c      13   2005000
Warning message:
One or more parsing issues, see `problems()` for details 

I can parse the data. The warning I get is that the number of columns in lines 1 to 3 are wrong (and they are if the commas in the string are counted as field separators, but they are not otherwise). Even though read_csv() thinks the number of columns are wrong, it does parse the numbers in the final column correct.

If I have a string where "," is the decimal mark and "." the grouping mark, then I lose the final column completely:

> # This will go all wrong, because the locale is not respected if ',' is a separator inside a string
> read_csv(
+     'A, B, C,    D,      E
+     $1, a, a,  " 1,2%",  "1.100.200"
+     $2, b, b,  " 2,1%",  "  140.000"
+     $3, c, c, "13,0%",  "2.005.000"',
+     locale=locale(decimal_mark = ",", grouping_mark = "."),
+     col_types = "nccnn")
# A tibble: 3 × 5                                                                                                                       
      A B     C         D     E
  <dbl> <chr> <chr> <dbl> <dbl>
1     1 a     a         1     2
2     2 b     b         2     1
3     3 c     c        13     0
Warning message:
One or more parsing issues, see `problems()` for details 

The warning is the same as before, but what happens is that the comma in the percentages is interpreted as field separators--as if they were not quoted in a string--and the two two parts of the percentages are split into two columns (and the last column is tossed out completely).

Shouldn't the strings quote the commas so they are not interpreted as field separators? If not, is there any work around that doesn't involve changing the input files?


Solution

  • The problem seems to be the spaces between the commands and the start of the quotes. Does your actual data have those spaces or did you just but them there to try to make things look pretty for the example? This works fine:

    library(readr)
    read_csv(
         'A,B,C,D,E
         $1,a,a," 1,2%","1.100.200"
         $2,b,b," 2,1%","  140.000"
         $3,c,c,"13,0%","2.005.000"',
         locale=locale(decimal_mark = ",", grouping_mark = "."),
         col_types = "nccnn")
    #       A B     C         D       E
    #   <dbl> <chr> <chr> <dbl>   <dbl>
    # 1     1 a     a       1.2 1100200
    # 2     2 b     b       2.1  140000
    # 3     3 c     c      13   2005000