Search code examples
regexrnegative-lookbehind

Regular Expression in R with a negative lookbehind


So I have the following data, let's say called "my_data":

Storm.Type
TYPHOON
SEVERE STORM
TROPICAL STORM
SNOWSTORM AND HIGH WINDS

What I want is to classify whether or not each element in my_data$Storm.Type is a storm, BUT I don't want to include tropical storms as storms (I'm going to classify them separately), such that I would have

Storm.Type                    Is.Storm
TYPHOON                       0
SEVERE STORM                  1
TROPICAL STORM                0
SNOWSTORM AND HIGH WINDS      1

I have written the following code:

my_data$Is.Storm  <-  my_data[grep("(?<!TROPICAL) (?i)STORM"), "Storm.Type"]

But this only returns the "SEVERE STORM" as a storm (but leaves out SNOWSTORM AND HIGH WINDS). Thank you!


Solution

  • The problem is that you're looking for the string " STORM" with a preceding space, so "SNOWSTORM" does not qualify.

    As a fix, consider moving the space into your negative lookbehind assertion, like so:

    ss <- c("TYPHOON","SEVERE STORM","TROPICAL STORM","SNOWSTORM AND HIGH WINDS",
            "THUNDERSTORM")
    grep("(?<!TROPICAL )(?i)STORM", ss, perl = TRUE)
    # [1] 2 4 5
    grepl("(?<!TROPICAL )(?i)STORM", ss, perl = TRUE)
    # [1] FALSE  TRUE FALSE  TRUE  TRUE
    

    I didn't know that (?i) and (?-i) set whether you ignore case or not in regex. Cool find. Another way to do it is the ignore.case flag:

    grepl("(?<!tropical )storm", ss, perl = TRUE, ignore.case = TRUE)
    # [1] FALSE  TRUE FALSE  TRUE  TRUE
    

    Then define your column:

    my_data$Is.Storm  <-  grepl("(?<!tropical )storm", my_data$Storm.Type,
                                perl = TRUE, ignore.case = TRUE)