Search code examples
rregexstr-replacegsubstringr

how to replace a column with different strings to one string R?


I have a data frame like this:

levels<- c("level 1", "LEVEL 1", "Level 1 ", "Level I", "Level I ", 
"level one", "Level one", "Level One", "Level 1")
df<- as.data.frame(levels)
> df
 levels
1 level 1
2 LEVEL 1
3 Level 1 #this one has a space at the end. 
4 Level I
5 Level I #this one also has a space at the end. 
6 level one
7 Level one
8 Level One
9 Level 1 #this is the correct format I want. 

As you can see some of them are in Upper Case format, some of them have a space at the end, some of them mark "1" as a number, as characters, and even in roman numerals.

I know I can just write multiple lines with gsub(), but I wanted to find a less tedious way to solve this problem.

This data frame also includes the same issue with level 2, and level 3 (such that "level 2", "level III ", "level II", "Level Two", "level three","Level TWO"). Moreover, this data also includes strings that are not just "level #" but other strings such as "Level 1 with specifications", "Level 2 with specifications", "Level 3 with specifications", "Level 1 with others included", "Moderate", "Mild", "Severe", etc..

I do not want to replace strings such as ("Level 1 with specifications", "Level 2 with specifications", "Level 3 with specifications", "Level 1 with others included", "Moderate", "Mild", "Severe", etc..), but want to replace all of the oddly formatted Levels into just "Level 1", "Level 2", "Level 3".

I tried this using apply(), for loops with gsub(). However, none of them seems to work. I think this is maybe because gsub() can't take on a list?

I also wanted to use regular expressions to grab a pattern using str_replace(), but I can't figure out how to. I have never used str_replace() and am new to regular expressions.

Any ideas?


Solution

  • Here's a general approach allowing for levels to be in English words, Arabic or Roman numerals. The final output is always of the format "Level (Arabic numeral)".

    library(english)
    givePattern <- function(i)
      paste0("( |^)(", paste(i, tolower(as.character(as.roman(i))), as.character(english(i)), sep = "|"), ")( |$)")
    fixLevels <- function(x, lvls)
      Reduce(function(y, lvl) replace(y, grep(givePattern(lvl), y), paste("Level", lvl)), lvls, init = tolower(x))
    
    levels <- c(" level vi  ", "LEVEL Three  ", "   level thirteen", 
                "Level XXI", "level CXXIII", "    level fifty")
    fixLevels(levels, 1:150)
    # [1] "Level 6"   "Level 3"   "Level 13"  "Level 21"  "Level 123" "Level 50"
    

    The first argument of fixLevels is a vector of characters, while the second argument is a vector of all levels to check for in the specified vector.

    The function uses gsub to detect integer level i in any format, e.g.,

    givePattern(132)
    # [1] "( |^)(132|cxxxii|one hundred thirty two)( |$)"
    

    meaning that we look for 132 or cxxxii or one hundred thirty two that is next to spaces and/or sentence beginning/end. Everything is done in lower case terms.

    Now fixLevels utilizes givePattern. The anonymous function

    function(y, lvl) replace(y, grep(givePattern(lvl), y), paste("Level", lvl))
    

    takes some vector y, finds its elements where some form of level lvl is present, and replaces those elements with "Level lvl". Call this function f(y, lvl). We pass to Reduce this function f, a vector of levels lvls, and an initial vector tolower(x). Suppose that lvls is 1:3. What happens then is the following: r1 := f(x, 1), r2 := f(r1, 2), r3 := f(r2, 3), and we are done: r3 is out final output where each of the levels was taken care of.