Search code examples
rdataframerecode

Use recode to clean data frame column


How do I use recode() in order to "clean/strip" certain parts of a column in my data frame? The original data frame looks like this:

df <- data.frame(duration = c("concentration, up to 2 minutes", "concentration, up to 4 minutes", "up to 6 hours"), name = c("Earth", "Water", "Fire"))

The improved version looks this this:

df <- data.frame(duration = c("2 minutes", "4 minutes", "6 hours"), name = c("Earth", "Water", "Fire"))

So, I should delete "concentration," and "up to" or replace it by an empty string using the recode function.


Solution

  • Please find both solutions with dplyr::recode() and with strings::str_remove().

    My advice though is to learn the latter too. That way you will be able to learn much more powerful ways of transforming your strings through regular expressions.

    Solution with dplyr::recode()

    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    df <- data.frame(duration = c("concentration, up to 2 minutes", 
                                  "concentration, up to 4 minutes", 
                                  "up to 6 hours"), 
                     name = c("Earth", "Water", "Fire"))
    
    df$duration = recode(df$duration, 
                         "concentration, up to 2 minutes" = "2 minutes",
                         "concentration, up to 4 minutes" = "4 minutes",
                         "up to 6 hours" = "6 hours" )
    df
    #>    duration  name
    #> 1 2 minutes Earth
    #> 2 4 minutes Water
    #> 3   6 hours  Fire
    

    Created on 2020-05-04 by the reprex package (v0.3.0)

    Solution with stringr::str_remove()

    library(stringr)
    df <- data.frame(duration = c("concentration, up to 2 minutes", 
                                  "concentration, up to 4 minutes", 
                                  "up to 6 hours"), 
                     name = c("Earth", "Water", "Fire"))
    
    
    df$duration = str_remove( df$duration, "^.*(?=\\d)")
    df
    #>    duration  name
    #> 1 2 minutes Earth
    #> 2 4 minutes Water
    #> 3   6 hours  Fire
    

    Created on 2020-05-04 by the reprex package (v0.3.0)