I ran into a hiccup using the sub() and gsub() functions in R to rename/change a factor level in R. But I'm not sure why it is not working.
Scenario: I have some survey data that several factors with levels that truncate the high value. For example, a question about how many hours you worked last week stops at "89 + hrs". I want to change this level to just "89" so I can use it numerically for other activities. I know several ways to do this--so I don't need various other level change options.
I was following the instructions to use the sub() and gsub() functions from this site: http://www.cookbook-r.com/Manipulating_data/Renaming_levels_of_a_factor/ The concept is clear and straightforward.
Here is the initial example data:
x <- factor(c("a", "b", "c", "d"))
x
[1] a b c d
Levels: a b c d
I can change level d to level 89
x <- factor(c("a", "b", "c", "d"))
levels(x) <- sub("d", "89", levels(x))
x
[1] a b c 89
Levels: a b c 89
I'm fine when I introduce a space in the level:
x <- factor(c("a", "b", "c", "d"))
levels(x) <- sub("d", "89 hrs", levels(x))
x
[1] a b c 89 hrs
Levels: a b c 89 hrs
I'm ok when I introduce a + symbol into the new factor level:
x <- factor(c("a", "b", "c", "d"))
levels(x) <- sub("d", "89+ hrs", levels(x))
x
[1] a b c 89+ hrs
Levels: a b c 89+ hrs
But I get stuck when I'm trying to rename/change the level that has the + symbol into one without it:
x <- factor(c("a", "b", "c", "89+ hrs"))
x
[1] a b c 89+ hrs
Levels:89+ hrs a b c
levels(x) <- sub("89+ hrs", "d", levels(x))
x
[1] a b c 89+ hrs
Levels: 89+ hrs a b c
Same issue when I include the specific string example from the linked site:
levels(x) <- sub("^89+ hrs$", "d", levels(x))
x
[1] a b c 89+ hrs
Levels: 89+ hrs a b c
The I get the same issue if I use gsub() instead of sub() as well.
The issue also happens if I had a * instead of a +, but works if it is a dot (.) instead of a +. So I'm thinking it has something to do with certain special characters but not others.
Any thoughts why this isn't working with the + symbol and how I can use these functions? Thanks in advance!
The sub()
function uses regular expressions by default and +
is a special character for regular expressions. If you want to match a literal plus sign, use
levels(x) <- sub("89\\+ hrs", "d", levels(x))
or
levels(x) <- sub("89+ hrs", "d", levels(x), fixed=TRUE)
Nothing about this is really unique to factors. This is just how sub()
works with any character vector, and levels()
just happens to return a character vector.