Given a ;
delimited file of structure:
colA; colB; colC
1;A; 10
2;B; 11
3;C"; 12
4;D""; 15
5;"F";20
6;K"""; 21
7;""M";22
8; \""O;23
I would like to ensure that colB
is always imported verbatim as a character string. In particular, I would like to preserve all values including ""M"
and \""O
.
I'm currently trying:
require(readr)
tst_dta <- read_delim(
file = "test_file.csv",
escape_double = FALSE,
delim = ";",
col_types = cols(
colA = col_integer(),
colB = col_character(),
colC = col_integer()
)
)
but this returns:
> tst_dta
# A tibble: 8 x 3
colA colB colC
<int> <chr> <int>
1 1 A 10
2 2 B NA
3 3 "C\"" 12
4 4 "D\"\"" 15
5 5 F 20
6 6 "K\"\"\"" 21
7 7 "\"\"M\"" 22
8 8 " \\\"\"O" 23
The desired results should reflect:
colA colB colC
<int> <chr> <int>
1 A 10
2 B 11
3 C" 12
4 D"" 15
5 "F" 20
6 K""" 21
7 ""M" 22
8 \""O 23
\""[Non-ASCII-Character]O
would appear in the resulting data frame as \""O
string.As per comments, more examples:
is:
colA; colB; colC
1; text \" text; 2
should be:
colA;colB;colC
1;text text;2
is:
colA; colB; colC
1; text \;" text; 2
should be:
colA;colB;colC
1;text text;2
is:
colA; colB; colC
1; [non-ASCII] text something \;" text; 2
should be:
colA;colB;colC
1;text something;2
If you need to use readr
-functions, then look at it's argument list and see if it has an equivalent to the quote
argument in read.table
(which allows simple access:
read.table(text=txt, header=TRUE, quote="", sep=";")
colA colB colC
1 1 A 10
2 2 B 11
3 3 C" 12
4 4 D"" 15
5 5 "F" 20
6 6 K""" 21
7 7 ""M" 22
8 8 ""O 23
Seems like it should succeed, since it's the third argument in readr::read_delim
. The default in both cases is "\""
which is a single double-quote. Set it to an empty character (""
):
read_delim(file, delim, quote = "\"", escape_backslash = FALSE,
escape_double = TRUE, col_names = TRUE, col_types = NULL,
locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
comment = "", trim_ws = FALSE, skip = 0, n_max = Inf,
guess_max = min(1000, n_max), progress = show_progress())
And this is the print representation of the result. I would note that this print representation seems bit irregular. Character values are enclosed in double quotes only if they have embedded double quotes, i.e \"
. On the other hand such columns are character which is a nice change from the default settings in read.table which give you factor columns:
read_delim(file=txt, quote="", delim=";")
# A tibble: 8 x 3
colA ` colB` ` colC`
<int> <chr> <chr>
1 1 A " 10"
2 2 B " 11 "
3 3 "C\"" " 12"
4 4 "D\"\"" " 15"
5 5 "\"F\"" 20
6 6 "K\"\"\"" " 21"
7 7 "\"\"M\"" 22
8 8 " \"\"O" 23
You are hereby warned that using this option with read_delim
does mean that neither column names nor values are trimmed to remove whitespace. And everything is character
, even the columns that would otherwise come in as character
. Notice the name of your second column. That does not occur with read.table
:
read_delim(file=txt, quote="", delim=";")$` colB` ==
read.table(text=txt, header=TRUE, quote="", sep=";")$colB
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Further gsub
-processing would be needed if you wanted leading or trailing whitespace to be removed. rm_non_ascii
in pkg {qdapRegex} can remove non-ASCII characters