Search code examples
rcsvimportspecial-charactersreadr

Forcing read_delim in readr to treat multiple " and \ as part of column string


Given a ; delimited file of structure:

colA; colB; colC
1;A; 10
2;B; 11     
3;C"; 12
4;D""; 15
5;"F";20
6;K"""; 21
7;""M";22
8; \""O;23

I would like to ensure that colB is always imported verbatim as a character string. In particular, I would like to preserve all values including ""M" and \""O.

Attempt

I'm currently trying:

require(readr)
tst_dta <- read_delim(
  file = "test_file.csv",
  escape_double = FALSE,
  delim = ";",
  col_types = cols(
    colA = col_integer(),
    colB = col_character(),
    colC = col_integer()
  )
)

but this returns:

> tst_dta
# A tibble: 8 x 3
   colA colB        colC
  <int> <chr>      <int>
1     1 A             10
2     2 B             NA
3     3 "C\""         12
4     4 "D\"\""       15
5     5 F             20
6     6 "K\"\"\""     21
7     7 "\"\"M\""     22
8     8 " \\\"\"O"    23

Desired rsults

The desired results should reflect:

    colA colB  colC
   <int> <chr> <int>
    1     A     10
    2     B     11     
    3     C"    12
    4     D""   15
    5    "F"    20
    6     K"""  21
    7   ""M"    22
    8  \""O     23

Other points:

  • Ideally, I would also like to ensure that non-ASCII characters are ignored in a manner that value \""[Non-ASCII-Character]Owould appear in the resulting data frame as \""O string.

Updates

As per comments, more examples:

is:

colA; colB; colC
1; text \" text; 2

should be:

colA;colB;colC
1;text text;2

is:

colA; colB; colC
1; text \;" text; 2

should be:

colA;colB;colC
1;text text;2

is:

colA; colB; colC
1; [non-ASCII] text something \;" text; 2

should be:

colA;colB;colC
1;text something;2

Solution

  • If you need to use readr-functions, then look at it's argument list and see if it has an equivalent to the quote argument in read.table (which allows simple access:

     read.table(text=txt, header=TRUE, quote="", sep=";")
      colA colB colC
    1    1    A   10
    2    2    B   11
    3    3   C"   12
    4    4  D""   15
    5    5  "F"   20
    6    6 K"""   21
    7    7 ""M"   22
    8    8  ""O   23
    

    Seems like it should succeed, since it's the third argument in readr::read_delim. The default in both cases is "\"" which is a single double-quote. Set it to an empty character (""):

    Usage

    read_delim(file, delim, quote = "\"", escape_backslash = FALSE,
      escape_double = TRUE, col_names = TRUE, col_types = NULL,
      locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
      comment = "", trim_ws = FALSE, skip = 0, n_max = Inf,
      guess_max = min(1000, n_max), progress = show_progress())
    

    And this is the print representation of the result. I would note that this print representation seems bit irregular. Character values are enclosed in double quotes only if they have embedded double quotes, i.e \". On the other hand such columns are character which is a nice change from the default settings in read.table which give you factor columns:

    read_delim(file=txt,  quote="", delim=";")
    # A tibble: 8 x 3
       colA ` colB`   ` colC`   
      <int> <chr>     <chr>     
    1     1 A         " 10"     
    2     2 B         " 11     "
    3     3 "C\""     " 12"     
    4     4 "D\"\""   " 15"     
    5     5 "\"F\""   20        
    6     6 "K\"\"\"" " 21"     
    7     7 "\"\"M\"" 22        
    8     8 " \"\"O"  23 
    

    You are hereby warned that using this option with read_delim does mean that neither column names nor values are trimmed to remove whitespace. And everything is character, even the columns that would otherwise come in as character. Notice the name of your second column. That does not occur with read.table:

    read_delim(file=txt,  quote="", delim=";")$` colB` == 
             read.table(text=txt, header=TRUE, quote="", sep=";")$colB
    [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
    

    Further gsub-processing would be needed if you wanted leading or trailing whitespace to be removed. rm_non_ascii in pkg {qdapRegex} can remove non-ASCII characters