Search code examples
regexrdna-sequence

replace partial of character string in a data frame by conditions in r


I have a data frame like this:

df = read.table(text="REF   Alt S00001  S00002  S00003  S00004  S00005
 TAAGAAG    TAAG    TAAGAAG/TAAGAAG TAAGAAG/TAAG    TAAG/TAAG   TAAGAAG/TAAGAAG TAAGAAG/TAAGAAG
 T  TG  T/T -/- TG/TG   T/T T/T
 CAAAA  CAAA    CAAAA/CAAAA CAAAA/CAAA  CAAAA/CAAAA -/- CAAAA/CAAAA
 TTGT   TTGTGT  TTGT/TTGT   TTGT/TTGT   TTGT/TTGT   TTGTGT/TTGTGT   TTGT/TTGTGT
 GTTT   GTTTTT  GTTT/GTTTTT GTTT/GTTT   GTTT/GTTT   GTTT/GTTT   GTTTTT/GTTTTT", header=T, stringsAsFactors=F)

I would like to replace the character elements separated by "/" with either "D" or "I", depending on the length of strings in columns "REF" and "Alt". If the elements match the longest one, they would be replaced by "I", otherwise replaced by "D". But no change for "-". So the result is expected as:

REF Alt S00001  S00002  S00003  S00004  S00005
TAAGAAG TAAG    I/I I/D D/D I/I I/I
T   TG  D/D -/- I/I D/D D/D
CAAAA   CAAA    I/I I/D I/I -/- I/I
TTGT    TTGTGT  D/D D/D D/D I/I D/I
GTTT    GTTTTT  D/I D/D D/D D/D I/I

Solution

  • Here is one approach. I used the stringi package because it does well with vectors of patterns and vectors of strings to search in.

    First establish which string is shorter, which is longer:

    short <- ifelse(nchar(df$Alt) > nchar(df$REF), df$REF, df$Alt)
    long <- ifelse(nchar(df$REF) > nchar(df$Alt), df$REF, df$Alt)
    

    Use those and loop over your columns, assigning a replacement as appropriate. Replace against the long patterns first to avoid issues with strings that match the both the short and long patterns:

    library(stringi)
    
    df[,!(names(df) %in% c("REF", "Alt"))] <- # assign into original df
      lapply(1:(ncol(df) - 2), # - 2 because there are two columns we don't use
        function(ii) stri_replace_all_fixed(df[ ,ii + 2], long, "I")) # + 2 to skip first 2 columns
    
    df[,!(names(df) %in% c("REF", "Alt"))] <- 
      lapply(1:(ncol(df) - 2),
        function(ii) stri_replace_all_fixed(df[ ,ii + 2], short, "D"))
    
    #      REF    Alt S00001 S00002 S00003 S00004 S00005
    #1 TAAGAAG   TAAG    I/I    I/D    D/D    I/I    I/I
    #2       T     TG    D/D    -/-    I/I    D/D    D/D
    #3   CAAAA   CAAA    I/I    I/D    I/I    -/-    I/I
    #4    TTGT TTGTGT    D/D    D/D    D/D    I/I    D/I
    #5    GTTT GTTTTT    D/I    D/D    D/D    D/D    I/I