Search code examples
rstringgrepl

Find if a string appear before another string


I have a string variable containing patients' addresses. My goal is to flag patients who live in "401 30th street". I would like to flags strings that contain the number "401" before "30" to avoid flagging addresses like number 3. My code below only flag whether the string contains the number 401 and 30 regardless of their positions. Any help would be appreciate it.

                      ADDRESS Outcome
1                401 300th st   FALSE
2             40120 30 street   FALSE
3                  30 401 plz    TRUE
4                 401 30th st    TRUE
5    401 e gibbsborro rd, 305   FALSE
6  401 e 30th street, shelter    TRUE
7               401 east 30st    TRUE
8         401 e30th street, 3    TRUE
9      77-02 30th ave, 3rd fl   FALSE
10                401 e30 st.    TRUE
structure(list(ADDRESS = c("401 300th st", "40120 30 street", 
"30 401 plz", "401 30th st", "401 e gibbsborro rd, 305", "401 e 30th street, shelter", 
"401 east 30st", "401 e30th street, 3", "77-02 30th ave, 3rd fl", 
"401 e30 st."), Outcome = c(FALSE, FALSE, TRUE, TRUE, FALSE, 
TRUE, TRUE, TRUE, FALSE, TRUE)), class = "data.frame", row.names = c(NA, 
-10L))
loction <- location %>%
  mutate(ADDRESS = tolower(ADDRESS),
         st30 =  grepl("\\<401\\>", ADDRESS) & 
          grepl("\\<30\\>|\\<30th\\>|\\<30st\\>|\\<e30th\\>|\\<e30\\>", ADDRESS))

Edit: I added new observations to the sample data as well as the variable I am looking to generate. The idea is to flag patients from 401 30th Street. To do this I would like to flag strings that have the number 401 before 30|30th|s30|east30|e30st etc. I hope this clarifies what I am looking for. Thanks.


Solution

  • To address the updated question, you need to use

    grepl("^(?=.*\\b401\\b)(?=.*?\\be?30(?:th|st)?\\b)", ADDRESS, perl=TRUE)
    

    See the regex demo and the R demo. Details:

    • ^ - start of string
    • (?=.*\b401\b) - there must be 401 whole word somewhere after any zero or more chars other than line break chars, as many as possible
    • (?=.*?\be?30(?:th|st)?\b) - there must be a word boundary, an optional e, 30, then an optional th or st char sequence and a word boundary somewhere after any zero or more chars other than line break chars, as many as possible

    When you use two separate `grepl` calls, the matches are searched for irrespective of the order of their appearance in the string.

    Maching two substrings in order means

    • Matching the leftmost pattern
    • Matching any chars (because the regex engine must somehow get to the second pattern) with a pattern like .*, .*?, [\s\S]*?, (?s:.)*? (the latter two are PCRE/ICU compliant), etc.
    • Matching the rightmost pattern.

    So, here, as there are no line breaks in the input, you could probably use

    df %>%
        mutate(st30 = grepl('401.*?30', ADDRESS))
    

    However, 401 and 30 patterns above are matching in any context. If you want to match them as exact integer values, you need to use numeric boundaries:

    grepl('(?<!\\d)401(?!\\d).*?(?<!\\d)30(?!\\d)', ADDRESS, perl=TRUE)
    

    Probably, you can also get away with simple word boundaries at the start of these numeric patterns (i.e. before them, no letter, digit or underscore are allowed):

    grepl('\\b401(?!\\d).*?\\b30(?!\\d)', ADDRESS, perl=TRUE)