Search code examples
rregextextdata-manipulationtext-extraction

Extracting text based on condition in R


I am relatively new to R. I have a character variable named RN whose text needs to be extracted into 2 variables [named_RN and general_RN] based on some conditions on RN. This is what the desired result is (right now, named_RN and general_RN are blank - I don't know how to code this part and that's what I need help on!):

RN                                              named_RN         general_RN
RP4A60D26L (Pentazocine)                        Pentazocine
0 (Complement C4)                                                Complement C4
0 (Aminocap) U6206 (Amino)                      Amino            Aminocap
N3R30 (Amiodarone) 0 (Benzo) 0 (Ferri)          Amiodarone       Benzo, Ferri

As you can see, I am trying to extract the information within the parentheses. However, I want to extract from RN into general_RN if it has a code of 0 and extract into named_RN if it has a non-zero code.

The main problem I am running into is that I cannot gsub by 0 ( or 0 ( [space before 0 in the latter one because sometimes the 0 code starts in the middle of the text in RN as is the case in the last row] because some of the codes for named_RN end with 0 ( as is the case in the last row.

Please advise.

Thank you!


Solution

  • Here's one way to do it. Basically, I create a new column where matches are easier to detect. Then, I match the inside of the parenthesis with regmatches.

    df <- read.table(text="RN
    'RP4A60D26L (Pentazocine)'
    '0 (Complement C4)'
    '0 (Aminocap) U6206 (Amino)'
    'N3R30 (Amiodarone) 0 (Benzo) 0 (Ferri)'",header=TRUE,stringsAsFactors=FALSE)
    
    df$RN_temp <- gsub("^[0] "," general_RN",df$RN) #replace leading 0s w/ general_RN
    df$RN_temp <- gsub(" [0] "," general_RN",df$RN_temp) #replace other " 0 "
    df$RN_temp <- gsub(" \\("," named_RN(",df$RN_temp) #replace rest w/ named_RN
    df$RN_temp
    
    df$named_RN <- regmatches(df$RN_temp,gregexpr("(?<=named_RN\\().*?(?=\\))",
                    df$RN_temp, perl=TRUE))
    df$general_RN <- regmatches(df$RN_temp,gregexpr("(?<=general_RN\\().*?(?=\\))", 
                      df$RN_temp, perl=TRUE))
    df$RN_temp <- NULL
    df
    

    EDIT To transform into a data.frame. I use lapply(df$named_RN, function(x) ifelse(is.null(x), NA, x)) to change missing values (NULL) to NA.

    df$named_RN <- unlist(lapply(df$named_RN, function(x) ifelse(is.null(x), NA, x)))
    df$general_RN <- unlist(df$general_RN)
    
    'data.frame':   4 obs. of  3 variables:
     $ RN        : chr  "RP4A60D26L (Pentazocine)" "0 (Complement C4)" "0 (Aminocap) U6206 (Amino)" "N3R30 (Amiodarone) 0 (Benzo) 0 (Ferri)"
     $ named_RN  : chr  "Pentazocine" NA "Amino" "Amiodarone"
     $ general_RN: chr  "Complement C4" "Aminocap" "Benzo" "Ferri"
                                          RN    named_RN    general_RN
    1               RP4A60D26L (Pentazocine) Pentazocine              
    2                      0 (Complement C4)             Complement C4
    3             0 (Aminocap) U6206 (Amino)       Amino      Aminocap
    4 N3R30 (Amiodarone) 0 (Benzo) 0 (Ferri)  Amiodarone  Benzo, Ferri