Search code examples
rrenamelevelsrecode

How can I use a sub function in R to change a factor level that has a plus (+ ) symbol?


I ran into a hiccup using the sub() and gsub() functions in R to rename/change a factor level in R. But I'm not sure why it is not working.

Scenario: I have some survey data that several factors with levels that truncate the high value. For example, a question about how many hours you worked last week stops at "89 + hrs". I want to change this level to just "89" so I can use it numerically for other activities. I know several ways to do this--so I don't need various other level change options.

I was following the instructions to use the sub() and gsub() functions from this site: http://www.cookbook-r.com/Manipulating_data/Renaming_levels_of_a_factor/ The concept is clear and straightforward.

Here is the initial example data:

x <- factor(c("a", "b", "c", "d"))
x
[1] a b c d
Levels: a b c d

I can change level d to level 89

x <- factor(c("a", "b", "c", "d"))
levels(x) <- sub("d", "89", levels(x))
x
[1] a b c 89
Levels: a b c 89

I'm fine when I introduce a space in the level:

x <- factor(c("a", "b", "c", "d"))
levels(x) <- sub("d", "89 hrs", levels(x))
x
[1] a b c 89 hrs
Levels: a b c 89 hrs

I'm ok when I introduce a + symbol into the new factor level:

x <- factor(c("a", "b", "c", "d"))
levels(x) <- sub("d", "89+ hrs", levels(x))
x
[1] a b c 89+ hrs
Levels: a b c 89+ hrs 

But I get stuck when I'm trying to rename/change the level that has the + symbol into one without it:

x <- factor(c("a", "b", "c", "89+ hrs"))
x
[1] a b c 89+ hrs
Levels:89+ hrs a b c

levels(x) <- sub("89+ hrs", "d", levels(x))
x
[1] a b c 89+ hrs
Levels: 89+ hrs a b c

Same issue when I include the specific string example from the linked site:

levels(x) <- sub("^89+ hrs$", "d", levels(x))
x
[1] a b c 89+ hrs
Levels: 89+ hrs a b c

The I get the same issue if I use gsub() instead of sub() as well.

The issue also happens if I had a * instead of a +, but works if it is a dot (.) instead of a +. So I'm thinking it has something to do with certain special characters but not others.

Any thoughts why this isn't working with the + symbol and how I can use these functions? Thanks in advance!


Solution

  • The sub() function uses regular expressions by default and + is a special character for regular expressions. If you want to match a literal plus sign, use

    levels(x) <- sub("89\\+ hrs", "d", levels(x))
    

    or

    levels(x) <- sub("89+ hrs", "d", levels(x), fixed=TRUE)
    

    Nothing about this is really unique to factors. This is just how sub() works with any character vector, and levels() just happens to return a character vector.