I have a data file in weird format that I'm trying to import and wrangle into R. The file is like a CSV in that values are comma delimited. However, there are also text strings containing commas that shouldn't be interpreted as delimiters.
The problem is, there is no use of quotes for individual elements. In the proprietary software, text strings containing non-delimiting commas are bounded by special, non-ASCII characters, which I'm guessing is how that software determines which commas not to use as delimiters.
I've figured out how to replace the non-ASCII characters with ASCII characters, to make it easier to work with in R. But, I still haven't figured out how to correctly split the text.
My current approach is to replace all the non-delimiting commas with another symbol (say, a semicolon) and then use strsplit
for the commas. But, I haven't figured out how to replace the non-delimiting commas with semicolons.
For data privacy reasons, I'll create a dummy text string to work with here:
my_string <- "1,2,3,4,NONASCIIsome text, hereNONASCII,7,8,9,NONASCII,more, pesky, commas,NONASCII,10,11"
My desired output:
[1] "1" "2" "3" "4"
[5] "some text; here" "7" "8" "9"
[9] ";more; pesky; commas;" "10" "11"
If I run strsplit
as-is (after removing the "NONASCII"), I'd get something like this, which is not what I want:
[1] "1" "2" "3" "4" "some text" " here" "7" "8" "9"
[10] "" "more" " pesky" " commas" "" "10" "11"
I'm not sure if gsub
is the best function to use for this, but I've figured out how to at least match the problematic strings with regex:
my_string2 <- gsub("NONASCII(.*?)NONASCII", "\\1", my_string)
That's about as far as I've gotten. The code above only gets rid of the "NONASCII", but it doesn't replace the non-delimiting commas with semicolons
If there's another approach that would work too, I'm all ears!
You can replace 'NONASCII' with quotes, scan the data in, and replace the commas with semicolons:
my_string |>
gsub("NONASCII", "'", x = _) |>
scan(text = _, sep = ",", what = character(), quiet = TRUE) |>
gsub(",", ";", x = _)
[1] "1" "2" "3"
[4] "4" "some text; here" "7"
[7] "8" "9" ";more; pesky; commas;"
[10] "10" "11"