I am relatively new to R. I have a character variable named RN
whose text needs to be extracted into 2 variables [named_RN
and general_RN
] based on some conditions on RN
. This is what the desired result is (right now, named_RN
and general_RN
are blank - I don't know how to code this part and that's what I need help on!):
RN named_RN general_RN
RP4A60D26L (Pentazocine) Pentazocine
0 (Complement C4) Complement C4
0 (Aminocap) U6206 (Amino) Amino Aminocap
N3R30 (Amiodarone) 0 (Benzo) 0 (Ferri) Amiodarone Benzo, Ferri
As you can see, I am trying to extract the information within the parentheses. However, I want to extract from RN
into general_RN
if it has a code of 0
and extract into named_RN
if it has a non-zero code.
The main problem I am running into is that I cannot gsub by 0 (
or 0 (
[space before 0 in the latter one because sometimes the 0
code starts in the middle of the text in RN
as is the case in the last row] because some of the codes for named_RN
end with 0 (
as is the case in the last row.
Please advise.
Thank you!
Here's one way to do it. Basically, I create a new column where matches are easier to detect. Then, I match the inside of the parenthesis with regmatches
.
df <- read.table(text="RN
'RP4A60D26L (Pentazocine)'
'0 (Complement C4)'
'0 (Aminocap) U6206 (Amino)'
'N3R30 (Amiodarone) 0 (Benzo) 0 (Ferri)'",header=TRUE,stringsAsFactors=FALSE)
df$RN_temp <- gsub("^[0] "," general_RN",df$RN) #replace leading 0s w/ general_RN
df$RN_temp <- gsub(" [0] "," general_RN",df$RN_temp) #replace other " 0 "
df$RN_temp <- gsub(" \\("," named_RN(",df$RN_temp) #replace rest w/ named_RN
df$RN_temp
df$named_RN <- regmatches(df$RN_temp,gregexpr("(?<=named_RN\\().*?(?=\\))",
df$RN_temp, perl=TRUE))
df$general_RN <- regmatches(df$RN_temp,gregexpr("(?<=general_RN\\().*?(?=\\))",
df$RN_temp, perl=TRUE))
df$RN_temp <- NULL
df
EDIT
To transform into a data.frame
. I use lapply(df$named_RN, function(x) ifelse(is.null(x), NA, x))
to change missing values (NULL) to NA.
df$named_RN <- unlist(lapply(df$named_RN, function(x) ifelse(is.null(x), NA, x)))
df$general_RN <- unlist(df$general_RN)
'data.frame': 4 obs. of 3 variables:
$ RN : chr "RP4A60D26L (Pentazocine)" "0 (Complement C4)" "0 (Aminocap) U6206 (Amino)" "N3R30 (Amiodarone) 0 (Benzo) 0 (Ferri)"
$ named_RN : chr "Pentazocine" NA "Amino" "Amiodarone"
$ general_RN: chr "Complement C4" "Aminocap" "Benzo" "Ferri"
RN named_RN general_RN
1 RP4A60D26L (Pentazocine) Pentazocine
2 0 (Complement C4) Complement C4
3 0 (Aminocap) U6206 (Amino) Amino Aminocap
4 N3R30 (Amiodarone) 0 (Benzo) 0 (Ferri) Amiodarone Benzo, Ferri