Search code examples
rstringdplyrstring-matching

Is there a way to recode a vector of strings based on two key words or phrases that appear in every value into new vector with those two values?


As my question indicates, I would like to convert a vector of strings into a new vector one of two values that appears in every string. Here is an example of a very simple data frame I have:

data <- tibble::tibble(
  w = c("Strongly disagree", "Somewhat disagree", "Disagree", "Somewhat agree", "Strongly agree", "Agree"),
  x = c("Definitely true", "Probably true", "Somewhat false", "Definitely false", "Definitely true", "Definitely false"),
  y = c("Definitely not doing enough", "Definitely doing enough", "Possibly not doing enough", "Possibly doing enough", "Definitely not doing enough", "Somehat doing enough"),
  z = c("Very comfortable", "Comfortable", "Somewhat comfortable", "Very uncomfortable", "Somewhat uncomfortable", "Comfortable")
)

We can see that every string in w has either "agree" or "disagree", x has either "true" or "false", y has "doing enough" or "not doing enough", and z has either "comfortable" or "uncomfortable". Is there a function that would allow me to create a new vector based on the one of two values present in each column? Let me illustrate what I mean.

# write up a function
some_function <- function(arguments) {
  "function text goes here"
}

# use new function to create a vector based on `w` from `data`
data %>% some_function(w)

# resulting vector would be:
[1] "Disagree" "Disagree" "Disagree" "Agree" "Agree" "Agree   

The closest I have gotten is with this function. However, it removes the first word of the string. This would be fine if the first word of each string was an adjective describing the rest of the string, but in the cases where the strings are just one word it gives me an NA.

# write function
make_dicho <- function(df = data, var) {
  
  df %>% 
    # pick out the column (equivalent to df[[var]])
    dplyr::pull({{ var }}) %>% 
    # convert to a factor
    haven::as_factor() %>% 
    # remove the first part of the factor
    stringr::str_extract("(?<=\\s).+") %>%
    # make the first letter uppercase
    stringr::str_to_sentence()
  
}
# test this on the fake data
data %>% make_dicho(., w)
[1] "Disagree" "Disagree" NA         "Agree"    "Agree"    NA  

The reason I have the df argument in there is because I would like to use this function inside of dplyr::mutate() like this data %>% mutate(new_a = make_dicho(., w).


Solution

  • It sounds from your description that you're happy with removing the first word, except in cases where there is more than one word. We can assume there's only one word if there are no spaces.

    remove_first_word  <- function(x) {
        ifelse(
            grepl("\\s", x),
            sub(".+\\s(*?)", "\\1", x),
            x
        )  |>
        # Make first letter upper case
        gsub("^([a-z])", "\\U\\1", x = _, perl = TRUE)
    }
    

    Then you can use it in mutate() as desired:

    data  |>
        mutate(
            across(w:z, remove_first_word)
        )
    # # A tibble: 6 × 4
    #   w        x     y                z            
    #   <chr>    <chr> <chr>            <chr>        
    # 1 Disagree True  Not doing enough Comfortable  
    # 2 Disagree True  Doing enough     Comfortable  
    # 3 Disagree False Not doing enough Comfortable  
    # 4 Agree    False Doing enough     Uncomfortable
    # 5 Agree    True  Not doing enough Uncomfortable
    # 6 Agree    False Doing enough     Comfortable  
    

    tidyverse version

    In response to your comment, here is a stringr version of the original function:

    remove_first_word_tidy  <- function(x) {
        dplyr::if_else(
            stringr::str_detect(x, "\\s"),
            stringr::str_replace(x, "\\w+\\s", ""),
            x
        )  |>
        stringr::str_to_title()
    }
    

    You can create a function which takes a data frame and list of columns and applies this function. As you want to use the tidyverse we can use tidy select functions and purrr::map() to apply it to all desired columns and produce a list of vectors:

    make_dicho  <- function(dat, cols) {
    
        out  <- dat  |>
            select({{cols}})  |>
            purrr::map(remove_first_word_tidy)
        
        # Return vector if only one column supplied
        if(length(out)==1) return(out[[1]])
        # Otherwise return list of vectors
        out
    }
    
    
    make_dicho(data, w) 
    # [1] "Disagree" "Disagree" "Disagree" "Agree"    "Agree"    "Agree"   
    
    make_dicho(data, y:z)
    # $y
    # [1] "Not Doing Enough" "Doing Enough"     "Not Doing Enough" "Doing Enough"     "Not Doing Enough" "Doing Enough"    
    
    # $z
    # [1] "Comfortable"   "Comfortable"   "Comfortable"   "Uncomfortable" "Uncomfortable" "Comfortable"