I have a string variable containing patients' addresses. My goal is to flag patients who live in "401 30th street". I would like to flags strings that contain the number "401" before "30" to avoid flagging addresses like number 3. My code below only flag whether the string contains the number 401 and 30 regardless of their positions. Any help would be appreciate it.
ADDRESS Outcome
1 401 300th st FALSE
2 40120 30 street FALSE
3 30 401 plz TRUE
4 401 30th st TRUE
5 401 e gibbsborro rd, 305 FALSE
6 401 e 30th street, shelter TRUE
7 401 east 30st TRUE
8 401 e30th street, 3 TRUE
9 77-02 30th ave, 3rd fl FALSE
10 401 e30 st. TRUE
structure(list(ADDRESS = c("401 300th st", "40120 30 street",
"30 401 plz", "401 30th st", "401 e gibbsborro rd, 305", "401 e 30th street, shelter",
"401 east 30st", "401 e30th street, 3", "77-02 30th ave, 3rd fl",
"401 e30 st."), Outcome = c(FALSE, FALSE, TRUE, TRUE, FALSE,
TRUE, TRUE, TRUE, FALSE, TRUE)), class = "data.frame", row.names = c(NA,
-10L))
loction <- location %>%
mutate(ADDRESS = tolower(ADDRESS),
st30 = grepl("\\<401\\>", ADDRESS) &
grepl("\\<30\\>|\\<30th\\>|\\<30st\\>|\\<e30th\\>|\\<e30\\>", ADDRESS))
Edit: I added new observations to the sample data as well as the variable I am looking to generate. The idea is to flag patients from 401 30th Street. To do this I would like to flag strings that have the number 401 before 30|30th|s30|east30|e30st etc. I hope this clarifies what I am looking for. Thanks.
To address the updated question, you need to use
grepl("^(?=.*\\b401\\b)(?=.*?\\be?30(?:th|st)?\\b)", ADDRESS, perl=TRUE)
See the regex demo and the R demo. Details:
^
- start of string(?=.*\b401\b)
- there must be 401
whole word somewhere after any zero or more chars other than line break chars, as many as possible(?=.*?\be?30(?:th|st)?\b)
- there must be a word boundary, an optional e
, 30
, then an optional th
or st
char sequence and a word boundary somewhere after any zero or more chars other than line break chars, as many as possibleMaching two substrings in order means
.*
, .*?
, [\s\S]*?
, (?s:.)*?
(the latter two are PCRE/ICU compliant), etc.So, here, as there are no line breaks in the input, you could probably use
df %>%
mutate(st30 = grepl('401.*?30', ADDRESS))
However, 401
and 30
patterns above are matching in any context. If you want to match them as exact integer values, you need to use numeric boundaries:
grepl('(?<!\\d)401(?!\\d).*?(?<!\\d)30(?!\\d)', ADDRESS, perl=TRUE)
Probably, you can also get away with simple word boundaries at the start of these numeric patterns (i.e. before them, no letter, digit or underscore are allowed):
grepl('\\b401(?!\\d).*?\\b30(?!\\d)', ADDRESS, perl=TRUE)