Search code examples
r

how to strip certain patterns in column headers and rename them all in one go


I have given my students two assessments and each has about 40 items long. Each assessment was given twice, once at pre-assessment and once again at post-assessment. The vocabulary column headers has one word that describes the content of each assessment item whereas the background knowledge (identified as bk in my dataset) has up to 4 words that describe the item. To demonstrate what I mean, the following is a sample dataset. What's tricky about this dataset is that the item order for the same question changes from pre- to post assessment. So for example, the "captivity" question was the first question at pre-assessment but was the 17th question at post. Same thing with bk. Values of 0s and 1s tell me whether students that got question correct (1s) for that item or not (0s).

df <- data.frame(vocab_pre_1_captivity_key=c(0, 1),
             vocab_post_17_captivity_key=c(1,0),
             bk_pre_1_food_chains_key=c(1,0),
             bk_post_16_food_chains_key=c(0,1),
             bk_pre_6_humans_harmed_orcas_key=c(0,1),
             bk_post_15_humans_harmed_orcas_key=c(1,0))

What I would ultimately like to accomplish is to report proportion correct for each item in a table side by side. However, to do that, I need the column headers to be renamed as follows:

captivity_pre

captivity_post

food_chains_pre

food_chains_post

humans_harmed_orcas_pre

humans_harmed_orcas_post

I've been scratching my head and one way that I know how to do this is to rename each individual column using dplyr::rename. However, this would be so tedious. I'm hoping if someone could show me a much more elegant solution to this problem. This is something that I will be doing again in future studies and it would be great to learn a more sustainable solution to this problem. Thank you for your time and any assistance you can provide.


Solution

  • You can use str_remove and other functions from the {stringr} package to modify column names using the rename_with function, which allows you to rename all your columns at once. The following code gets you close:

    library(dplyr)
    library(stringr)
    
    df2 <- df |> 
      tibble() |> 
      rename_with(~str_remove(.x, "bk_")) |> 
      rename_with(~str_remove(.x, "vocab_")) |> 
      rename_with(~str_remove(.x, "\\d+_")) |> # this is regex for "any number followed by _"
      rename_with(~str_remove(.x, "_key"))
    
    > names(df2)
    [1] "pre_captivity"            "post_captivity"           "pre_food_chains"         
    [4] "post_food_chains"         "pre_humans_harmed_orcas"  "post_humans_harmed_orcas"
    

    Then, you can further use rename_with to paste "pre" and "post" at the end of names, and then removing both strings from the beginning of the column names, so that we end up with your desired result:

    df3 <- df2 |> 
      # add "pre" and "post" to the end of the column names
      rename_with(~paste0(.x, "_pre"), contains("pre")) |> 
      rename_with(~paste0(.x, "_post"), contains("post")) |> 
      # remove "pre" and "post" from the beginning of column names
      rename_with(~str_remove(.x, "pre_")) |> 
      rename_with(~str_remove(.x, "post_"))
    
    > names(df3)
    [1] "captivity_pre"            "captivity_post"           "food_chains_pre"         
    [4] "food_chains_post"         "humans_harmed_orcas_pre"  "humans_harmed_orcas_post"
    

    This method is more readable, but you can also use the "or" operator (|) inside the str_remove functions to make this code shorter, for example, using ~str_remove(.x, "pre_|post_") to delete both strings at the same time instead of using str_remove twice.