Search code examples
rescapingbackslash

How to replace \"" with \""" in R?


I have a tsv that has some fields with some escaped double quotation marks (DQM). Those are fine, but the problem comes when the DQM delimiter comes after one of those and then when the tsv is opened it doesn't recognize the last DQM and so the following field gets included with the previous.

For example, the line:

"9812" "tt0167609" "tvSeries" "L'homme du \"Picardie\"" "L'homme du \"Picardie\"" 0 "1968" "\N" "13" "Drama"

when opened in a SS, places the first 3 fields fine. But the fourth field it includes all the way up to 1968 when what it should do is only put in the first "L'homme du \"Picardie\", the second of the same in the next field, and so on. And so the problem appears to be that it's not recognizing the " after the \". I tried clicking different options when opening it in SS programs, but nothing fixes it.

Now I've found out that I can fix this before opening it in a SS program by replacing \"" with \""" in a text editor, but I'd like to be able to do it in R when the file is generated.

I've tried several combinations of strings, but I just can't figure it out and I'm hoping someone can point me in the right direction. The following are just some of what I tried.

tv.Subset <- str_replace(tv.Subset, "\\\"\"", "\\\"\"\"") - one of my first attempts, simply escape each character in the string

tv.Subset <- str_replace(tv.Subset, '\\\"\"', '\\\"\"\"') - I wondered if single quotation marks for delimiters might be the trick

tv.Subset <- str_replace(tv.Subset, "\\\\\\"\\"", "\\\\\\"\\"\\"") - I read that you need to do double backslashes to respect both R and regex

Thanks.


Solution

  • If I am understanding your question correctly, R is generating .tsv output files that you cannot open in spreadsheet programs due to the escaped double quotes in R's output.

    If so, here is a reproducible example where the third row is your original data:

    df <- data.frame(f1=c(1,27,9812,9999),
                     f2=c('tt12345','tt54321','tt0167609','tt98765'),
                     f3=rep("tvSeries",4),
                     f4=c("Some string","Another string","L'homme du \"Picardie\"","Yet another string"),
                     f5=c("Here is another string","Another string again","L'homme du \"Picardie\"","Also a string"),
                     f6=c(0,0,0,0),
                     f7=c(1966,1967,1968,1969),
                     f8=rep("\\N",4),
                     f9=c(11,12,13,14),
                     f10=rep('Drama',4))
    write.table(df,'outfile.tsv')
    

    Viewed in R, your data would look like this:

    > print(df,row.names=F)
       f1        f2       f3                    f4                     f5 f6   f7  f8 f9   f10
        1   tt12345 tvSeries           Some string Here is another string  0 1966 \\N 11 Drama
       27   tt54321 tvSeries        Another string   Another string again  0 1967 \\N 12 Drama
     9812 tt0167609 tvSeries L'homme du "Picardie"  L'homme du "Picardie"  0 1968 \\N 13 Drama
     9999   tt98765 tvSeries    Yet another string          Also a string  0 1969 \\N 14 Drama
    

    And the command write.table(df,'outfile.tsv') produces this output file, which some spreadsheet programs would fail to read:

    "f1" "f2" "f3" "f4" "f5" "f6" "f7" "f8" "f9" "f10"
    "1" 1 "tt12345" "tvSeries" "Some string" "Here is another string" 0 1966 "\N" 11 "Drama"
    "2" 27 "tt54321" "tvSeries" "Another string" "Another string again" 0 1967 "\N" 12 "Drama"
    "3" 9812 "tt0167609" "tvSeries" "L'homme du \"Picardie\"" "L'homme du \"Picardie\"" 0 1968 "\N" 13 "Drama"
    "4" 9999 "tt98765" "tvSeries" "Yet another string" "Also a string" 0 1969 "\N" 14 "Drama"
    

    The solution is not to change \" in R like you've been trying to, but instead to just not have R surround strings with quotes in your output. That way the only quotes in the output are the quotes that are part of a title, and spreadsheet programs will be able to treat them accordingly.

    Here is the command you can use to save your data without the escaped quotes: write.table(df,'outfile.tsv',quote=F,sep='\t'). This will do two things:

    1. quote=F will stop R from putting quotes around strings in your output. Since those quotes aren't there, it won't escape the quotes in titles.
    2. sep='\t' will make R separate output columns with a tab character rather than a space character (space is the default, and your example looks like it used space). You need sep='\t' when quote=F so spreadsheet programs can tell spaces within strings are still part of the string and not separators.

    Your new output file will look like this:

    f1  f2  f3  f4  f5  f6  f7  f8  f9  f10
    1   1   tt12345 tvSeries    Some string Here is another string  0   1966    \N  11  Drama
    2   27  tt54321 tvSeries    Another string  Another string again    0   1967    \N  12  Drama
    3   9812    tt0167609   tvSeries    L'homme du "Picardie"   L'homme du "Picardie"   0   1968    \N  13  Drama
    4   9999    tt98765 tvSeries    Yet another string  Also a string   0   1969    \N  14  Drama
    

    As long as the new output file uses tab characters, most spreadsheet programs will have no problem reading it like this.