I have a data frame like this:
levels<- c("level 1", "LEVEL 1", "Level 1 ", "Level I", "Level I ",
"level one", "Level one", "Level One", "Level 1")
df<- as.data.frame(levels)
> df
levels
1 level 1
2 LEVEL 1
3 Level 1 #this one has a space at the end.
4 Level I
5 Level I #this one also has a space at the end.
6 level one
7 Level one
8 Level One
9 Level 1 #this is the correct format I want.
As you can see some of them are in Upper Case format, some of them have a space at the end, some of them mark "1"
as a number, as characters, and even in roman numerals.
I know I can just write multiple lines with gsub()
, but I wanted to find a less tedious way to solve this problem.
This data frame also includes the same issue with level 2, and level 3 (such that "level 2", "level III ", "level II", "Level Two", "level three","Level TWO"
). Moreover, this data also includes strings that are not just "level #" but other strings such as "Level 1 with specifications", "Level 2 with specifications", "Level 3 with specifications", "Level 1 with others included", "Moderate", "Mild", "Severe", etc..
I do not want to replace strings such as ("Level 1 with specifications", "Level 2 with specifications", "Level 3 with specifications", "Level 1 with others included", "Moderate", "Mild", "Severe", etc..
), but want to replace all of the oddly formatted Levels into just "Level 1", "Level 2", "Level 3".
I tried this using apply()
, for loops with gsub()
. However, none of them seems to work. I think this is maybe because gsub()
can't take on a list?
I also wanted to use regular expressions to grab a pattern using str_replace()
, but I can't figure out how to. I have never used str_replace()
and am new to regular expressions.
Any ideas?
Here's a general approach allowing for levels to be in English words, Arabic or Roman numerals. The final output is always of the format "Level (Arabic numeral)".
library(english)
givePattern <- function(i)
paste0("( |^)(", paste(i, tolower(as.character(as.roman(i))), as.character(english(i)), sep = "|"), ")( |$)")
fixLevels <- function(x, lvls)
Reduce(function(y, lvl) replace(y, grep(givePattern(lvl), y), paste("Level", lvl)), lvls, init = tolower(x))
levels <- c(" level vi ", "LEVEL Three ", " level thirteen",
"Level XXI", "level CXXIII", " level fifty")
fixLevels(levels, 1:150)
# [1] "Level 6" "Level 3" "Level 13" "Level 21" "Level 123" "Level 50"
The first argument of fixLevels
is a vector of characters, while the second argument is a vector of all levels to check for in the specified vector.
The function uses gsub
to detect integer level i
in any format, e.g.,
givePattern(132)
# [1] "( |^)(132|cxxxii|one hundred thirty two)( |$)"
meaning that we look for 132 or cxxxii or one hundred thirty two that is next to spaces and/or sentence beginning/end. Everything is done in lower case terms.
Now fixLevels
utilizes givePattern
. The anonymous function
function(y, lvl) replace(y, grep(givePattern(lvl), y), paste("Level", lvl))
takes some vector y
, finds its elements where some form of level lvl
is present, and replaces those elements with "Level lvl". Call this function f(y, lvl)
. We pass to Reduce
this function f
, a vector of levels lvls
, and an initial vector tolower(x)
. Suppose that lvls
is 1:3
. What happens then is the following: r1 := f(x, 1), r2 := f(r1, 2), r3 := f(r2, 3), and we are done: r3 is out final output where each of the levels was taken care of.