As my question indicates, I would like to convert a vector of strings into a new vector one of two values that appears in every string. Here is an example of a very simple data frame I have:
data <- tibble::tibble(
w = c("Strongly disagree", "Somewhat disagree", "Disagree", "Somewhat agree", "Strongly agree", "Agree"),
x = c("Definitely true", "Probably true", "Somewhat false", "Definitely false", "Definitely true", "Definitely false"),
y = c("Definitely not doing enough", "Definitely doing enough", "Possibly not doing enough", "Possibly doing enough", "Definitely not doing enough", "Somehat doing enough"),
z = c("Very comfortable", "Comfortable", "Somewhat comfortable", "Very uncomfortable", "Somewhat uncomfortable", "Comfortable")
)
We can see that every string in w
has either "agree" or "disagree", x
has either "true" or "false", y
has "doing enough" or "not doing enough", and z
has either "comfortable" or "uncomfortable". Is there a function that would allow me to create a new vector based on the one of two values present in each column? Let me illustrate what I mean.
# write up a function
some_function <- function(arguments) {
"function text goes here"
}
# use new function to create a vector based on `w` from `data`
data %>% some_function(w)
# resulting vector would be:
[1] "Disagree" "Disagree" "Disagree" "Agree" "Agree" "Agree
The closest I have gotten is with this function. However, it removes the first word of the string. This would be fine if the first word of each string was an adjective describing the rest of the string, but in the cases where the strings are just one word it gives me an NA.
# write function
make_dicho <- function(df = data, var) {
df %>%
# pick out the column (equivalent to df[[var]])
dplyr::pull({{ var }}) %>%
# convert to a factor
haven::as_factor() %>%
# remove the first part of the factor
stringr::str_extract("(?<=\\s).+") %>%
# make the first letter uppercase
stringr::str_to_sentence()
}
# test this on the fake data
data %>% make_dicho(., w)
[1] "Disagree" "Disagree" NA "Agree" "Agree" NA
The reason I have the df
argument in there is because I would like to use this function inside of dplyr::mutate()
like this data %>% mutate(new_a = make_dicho(., w)
.
It sounds from your description that you're happy with removing the first word, except in cases where there is more than one word. We can assume there's only one word if there are no spaces.
remove_first_word <- function(x) {
ifelse(
grepl("\\s", x),
sub(".+\\s(*?)", "\\1", x),
x
) |>
# Make first letter upper case
gsub("^([a-z])", "\\U\\1", x = _, perl = TRUE)
}
Then you can use it in mutate()
as desired:
data |>
mutate(
across(w:z, remove_first_word)
)
# # A tibble: 6 × 4
# w x y z
# <chr> <chr> <chr> <chr>
# 1 Disagree True Not doing enough Comfortable
# 2 Disagree True Doing enough Comfortable
# 3 Disagree False Not doing enough Comfortable
# 4 Agree False Doing enough Uncomfortable
# 5 Agree True Not doing enough Uncomfortable
# 6 Agree False Doing enough Comfortable
tidyverse
versionIn response to your comment, here is a stringr
version of the original function:
remove_first_word_tidy <- function(x) {
dplyr::if_else(
stringr::str_detect(x, "\\s"),
stringr::str_replace(x, "\\w+\\s", ""),
x
) |>
stringr::str_to_title()
}
You can create a function which takes a data frame and list of columns and applies this function. As you want to use the tidyverse
we can use tidy select functions and purrr::map()
to apply it to all desired columns and produce a list of vectors:
make_dicho <- function(dat, cols) {
out <- dat |>
select({{cols}}) |>
purrr::map(remove_first_word_tidy)
# Return vector if only one column supplied
if(length(out)==1) return(out[[1]])
# Otherwise return list of vectors
out
}
make_dicho(data, w)
# [1] "Disagree" "Disagree" "Disagree" "Agree" "Agree" "Agree"
make_dicho(data, y:z)
# $y
# [1] "Not Doing Enough" "Doing Enough" "Not Doing Enough" "Doing Enough" "Not Doing Enough" "Doing Enough"
# $z
# [1] "Comfortable" "Comfortable" "Comfortable" "Uncomfortable" "Uncomfortable" "Comfortable"