Search code examples
rstringdataframedata.tableinsertion

How to add double quotes to after a certain position in a string in R


I have a data.table with many rows that look like this in R:

    V1        V2       V3    V4   V5  V6  V7  V8   V9           V10
 NCBINCC    GenBank   gene  331 1008  .   -   .   gene_id=UL1   protein_id=ABV71500.1
 NCBINCC    GenBank   gene  1009 1120  .  -   .  gene_id=UL4   protein_id=ABV71520
 NCBINCC    GenBank   gene  1135 1200  .  -   .  gene_id=UL6   protein_id=ABV71525

Is there a simple way to add quotes in between strings (after the strings gene_id= and protein_id=) so that they only encompass the different gene and proteins like the following output:

    V1        V2       V3    V4   V5  V6  V7  V8   V9            V10
 NCBINCC    GenBank   gene  331 1008  .   -   .   gene_id="UL1"  protein_id="ABV71500.1"
 NCBINCC    GenBank   gene  1009 1120 .   -   .  gene_id="UL4"  protein_id="ABV71520"
 NCBINCC    GenBank   gene  1135 1200 .   -   .  gene_id="UL6"  protein_id="ABV71525"

I have seen this answer for shell, but wanted to know if there was a way to also do it in R. Thank you kindly.


Solution

  • If you are bored from packages, you may want to try sub in an lapply.

    v <- c('V9', 'V10')
    d[v] <- lapply(d[v], sub, pa='\\=(.*)', re='="\\1"')
    d
    #        V1      V2   V3   V4   V5 V6 V7 V8            V9                     V10
    # 1 NCBINCC GenBank gene  331 1008  .  -  . gene_id="UL1" protein_id="ABV71500.1"
    # 2 NCBINCC GenBank gene 1009 1120  .  -  . gene_id="UL4"   protein_id="ABV71520"
    # 3 NCBINCC GenBank gene 1135 1200  .  -  . gene_id="UL6"   protein_id="ABV71525"
    

    Data

    d <- read.table(header=T, text='V1        V2       V3    V4   V5  V6  V7  V8   V9           V10
    NCBINCC    GenBank   gene  331 1008  .   -   .   gene_id=UL1   protein_id=ABV71500.1
    NCBINCC    GenBank   gene  1009 1120  .  -   .  gene_id=UL4   protein_id=ABV71520
    NCBINCC    GenBank   gene  1135 1200  .  -   .  gene_id=UL6   protein_id=ABV71525')