Search code examples
rregexstringrstringi

Best way to rename variables matching different naming patterns to denote time in consistent manner?


I have a wide dataset that has psychometric measures taken from participants across various timepoints.

Time varying labels within the psychometric measures are in the form: QuestionnaireTime_Item#. An example is dass1_1 where dass = Questionnaire, 1_ = Time_ questionnaire was administered; and 1 = Item# of the relevant questionnaire.

This is mostly consistent across the questionnaires, however there is one psychometric variable that does not follow this nomenclature: siss1. Although this nomenclature is consistent with other variables denoting the date and session number of data collection i.e., date1 and session1. As as can be seen, the labels for these variables are at the ends of the variables. However, there are a number of variables that contain a numeral in the name that should not be changed, specifically cff1, cff2, etc., which denote item number on this measure and not time (however, they are only asked once during the datefinal collection period [see below]).

Time in the variable names is denoted by numerals in most cases (1--14) with the exception of the word 'final' (e.g., datefinal, sessionfinal, dassfinal_1, sissfinal) for the last session. Additionally, there is a data collection period that took place at 6- and 12-months after the final session datefinal data collection period. These are denoted are denoted with 6fup or 12fup e.g., date_6fup, and dass6fup_2.

I would like change the string denoting the time variable to make it consistent and have it at the start of each variable name. Additionally, I would like to have an underscore between the name of the questionnaire and the relevant item number. For example:

  • date1 -> T1.date
  • session1 -> T1.session
  • siss2 -> T2.siss
  • dass1_1 -> T1.dass_1
  • datefinal -> T15.date
  • dass_6fup_2 -> T16.dass_2
  • date_12fup -> T17.date

What is the best way to do this given that the numerical value denoting the time changes and is inconsistent?

Currently, I have the below which was provided here:

names(old_sp_wide) <- sub("([a-z]+)(\\d+)(_\\d+)?", "T\\2.\\1\\3",
                          sub("final", "15", names(old_sp_wide)),
                          ignore.case = TRUE
                          )

However, this also changes the name for the variables with the cff prefix, and does not work as expected on the variables with the time label 6fup and 12fup.

What is the best way to do this given that the numerical value denoting the time changes and is inconsistent? Is there a way to this with stringr or stringi?

Please see below for a reproducible example.

structure(list(uci = 12345L, dob = structure(1L, .Label = "1988_01_26", class = "factor"),
               sex = 2L, sp_episode = 1L, staff = structure(1L, .Label = "aj", class = "factor"),
               YP_consent = 1L, date1 = structure(1L, .Label = "2016_10_03", class = "factor"),
               session1 = 1L, dass1_1 = 3L, dass1_2 = 0L, dass1_3 = 2L,
               siss1 = 1L, diag1 = NA, diag2 = NA, diag3 = NA, pastpsyc = NA,
               pastmed = NA, date2 = structure(1L, .Label = "2016_10_15", class = "factor"),
               session2 = 3L, dass2_1 = 3L, dass2_2 = 0L, dass2_3 = 2L,
               siss2 = NA, datefinal = structure(1L, .Label = "2016_11_12", class = "factor"),
               sessionfinal = 8L, dassfinal_1 = 2L, dassfinal_2 = 1L, dassfinal_3 = 2L,
               dassfinal_4 = 3L, sissfinal = NA, cff1 = NA, cff2 = NA, cff3 = NA,
               date_6fup = structure(1L, .Label = "2014_06_30", class = "factor"),
               dass6fup_2 = 3L, dass6fup_3 = 1L, dass6fup_4 = 1L, siss6fup = 2L,
               date_12fup = NA), class = "data.frame", row.names = c(NA,
                                                                     -1L))

Solution

  • Thank you for the reprex and the thorough explanation of your problem. If I understood correctly, the following routine should give you what you're after or, failing that, hopefully get you pretty close.

    I've used two rounds of stringr::str_replace_all. In the first round, we replace all final, 6fup, and 12fup suffixes with their indicated numerical equivalents (i.e. 15, 16, 17). In round two, we target the remaining two main regex patterns, making sure to exclude any matches that start with the cff prefix.

    # create new_names by applying two rounds of str_replace_all to the old names
    new_names <- names(df) %>%
      stringr::str_replace_all(c(
        'final' = '15',
        '_6fup|6fup' = '16',
        '_12fup|12fup' = '17'
      )) %>%
      stringr::str_replace_all(
        c(
          '^(?!cff\\d)(^[A-z]+)(\\d{1,2})$' = 'T\\2.\\1',
          '^(?!cff\\d)(^[A-z]+)(\\d{1,2})_(\\d)' = 'T\\2.\\1_\\3'
        )
      ) 
    
    # compare old names to new names
    new_names %>% purrr::set_names(names(df))
    #>           uci           dob           sex    sp_episode         staff 
    #>         "uci"         "dob"         "sex"  "sp_episode"       "staff" 
    #>    YP_consent         date1      session1       dass1_1       dass1_2 
    #>  "YP_consent"     "T1.date"  "T1.session"   "T1.dass_1"   "T1.dass_2" 
    #>       dass1_3         siss1         diag1         diag2         diag3 
    #>   "T1.dass_3"     "T1.siss"     "T1.diag"     "T2.diag"     "T3.diag" 
    #>      pastpsyc       pastmed         date2      session2       dass2_1 
    #>    "pastpsyc"     "pastmed"     "T2.date"  "T2.session"   "T2.dass_1" 
    #>       dass2_2       dass2_3         siss2     datefinal  sessionfinal 
    #>   "T2.dass_2"   "T2.dass_3"     "T2.siss"    "T15.date" "T15.session" 
    #>   dassfinal_1   dassfinal_2   dassfinal_3   dassfinal_4     sissfinal 
    #>  "T15.dass_1"  "T15.dass_2"  "T15.dass_3"  "T15.dass_4"    "T15.siss" 
    #>          cff1          cff2          cff3     date_6fup    dass6fup_2 
    #>        "cff1"        "cff2"        "cff3"    "T16.date"  "T16.dass_2" 
    #>    dass6fup_3    dass6fup_4      siss6fup    date_12fup 
    #>  "T16.dass_3"  "T16.dass_4"    "T16.siss"    "T17.date"