I have a single long string variable with 3 obs. I was trying to create a field prob to extract the specific string from the long string. the code and message is below.
data aa: "The probability of being a carrier is 0.0002422359 " " an BRCA1 carrier 0.0001061067 " " an BRCA2 carrier 0.00013612 "
enter code here aa$prob <- ifelse(grepl("The probability of being a carrier is", xx)==TRUE, word(aa, 8, 8), ifelse(grepl("BRCA", xx)==TRUE, word(aa, 5, 5), NA))
Warning message: In aa$prob <- ifelse(grepl("The probability of being a carrier is", : Coercing LHS to a list
Here is my previous answer, updated to reflect a data.frame
.
library(dplyr)
aa <- data.frame(aa = c("...", "...", "The probability of being a carrier is 0.0002422359 ", " an BRCA1 carrier 0.0001061067 ", " an BRCA2 carrier 0.00013612 ", "..."))
aa %>%
mutate(prob = as.numeric(if_else(grepl("(probability|BRCA[12] carrier)", aa),
gsub("^.*?\\b([0-9]+\\.?[0-9]*)\\s*$", "\\1", aa), NA_character_)))
# aa prob
# 1 ... NA
# 2 ... NA
# 3 The probability of being a carrier is 0.0002422359 0.0002422359
# 4 an BRCA1 carrier 0.0001061067 0.0001061067
# 5 an BRCA2 carrier 0.00013612 0.0001361200
# 6 ... NA
Regex walk-through:
^
and $
are beginning and end of string, respective; \\b
is a word-boundary; none of these "consume" any characters, they just mark beginnings and endings.
means one character?
means "zero or one", aka optional; *
means "zero or more"; +
means "one or more"; all refer to the previous character/class/group\\s
is blank space, including spaces and tabs[0-9]
is a class, meaning any character between 0 and 9; similarly, [a-z]
is all lowercase letters, [a-zA-Z]
are all letters, [0-9A-F]
are hexadecimal digits, etc(...)
is a saved group; it's not uncommon in a group to use |
as an "or"; this group is used later in the replacement=
part of gsub
as numbered groups, so \\1
recalls the first group from the patternSo grouped and summarized:
"^.*?\\b([0-9]+\\.?[0-9]*)\\s*$"
1 ^^^^^^^^^^^^^^^^^^
2 ^^^
3 ^^^
4 ^^^^
"12.345"
to be parsed as "2.345"
without this.Grouped logically, in an organized way
Regex isn't unique to R, it's a parsing language that R (and most other programming languages) supports in one way or another.