I have a wide dataset that has psychometric measures taken from participants across various timepoints.
Time varying labels within the psychometric measures are in the form:
QuestionnaireTime_Item#
.
An example is dass1_1
where dass
= Questionnaire
, 1_
= Time_
questionnaire was administered; and 1
= Item#
of the relevant questionnaire.
This is mostly consistent across the questionnaires, however there is one psychometric
variable that does not follow this nomenclature: siss1
. Although this
nomenclature is consistent with other variables denoting the date and session
number of data collection i.e., date1
and session1
.
As as can be seen, the labels for these variables are at the ends of the variables.
However, there are a number of variables that contain a numeral in the name that should not be changed, specifically cff1
, cff2
, etc., which denote item number on this measure and not time (however, they are only asked once during the datefinal
collection period [see below]).
Time in the variable names is denoted by numerals in most cases (1--14) with the exception of the word
'final' (e.g., datefinal
, sessionfinal
, dassfinal_1
, sissfinal
) for the last session.
Additionally, there is a data collection period that took place at 6- and 12-months after the final session datefinal
data collection period.
These are denoted are denoted with 6fup
or 12fup
e.g., date_6fup
, and dass6fup_2
.
I would like change the string denoting the time variable to make it consistent and have it at the start of each variable name. Additionally, I would like to have an underscore between the name of the questionnaire and the relevant item number. For example:
date1
-> T1.date
session1
-> T1.session
siss2
-> T2.siss
dass1_1
-> T1.dass_1
datefinal
-> T15.date
dass_6fup_2
-> T16.dass_2
date_12fup
-> T17.date
What is the best way to do this given that the numerical value denoting the time changes and is inconsistent?
Currently, I have the below which was provided here:
names(old_sp_wide) <- sub("([a-z]+)(\\d+)(_\\d+)?", "T\\2.\\1\\3",
sub("final", "15", names(old_sp_wide)),
ignore.case = TRUE
)
However, this also changes the name for the variables with the cff
prefix, and does not work as expected on the variables with the time label 6fup
and 12fup
.
What is the best way to do this given that the numerical value denoting the time changes and is inconsistent? Is there a way to this with stringr
or stringi
?
Please see below for a reproducible example.
structure(list(uci = 12345L, dob = structure(1L, .Label = "1988_01_26", class = "factor"),
sex = 2L, sp_episode = 1L, staff = structure(1L, .Label = "aj", class = "factor"),
YP_consent = 1L, date1 = structure(1L, .Label = "2016_10_03", class = "factor"),
session1 = 1L, dass1_1 = 3L, dass1_2 = 0L, dass1_3 = 2L,
siss1 = 1L, diag1 = NA, diag2 = NA, diag3 = NA, pastpsyc = NA,
pastmed = NA, date2 = structure(1L, .Label = "2016_10_15", class = "factor"),
session2 = 3L, dass2_1 = 3L, dass2_2 = 0L, dass2_3 = 2L,
siss2 = NA, datefinal = structure(1L, .Label = "2016_11_12", class = "factor"),
sessionfinal = 8L, dassfinal_1 = 2L, dassfinal_2 = 1L, dassfinal_3 = 2L,
dassfinal_4 = 3L, sissfinal = NA, cff1 = NA, cff2 = NA, cff3 = NA,
date_6fup = structure(1L, .Label = "2014_06_30", class = "factor"),
dass6fup_2 = 3L, dass6fup_3 = 1L, dass6fup_4 = 1L, siss6fup = 2L,
date_12fup = NA), class = "data.frame", row.names = c(NA,
-1L))
Thank you for the reprex and the thorough explanation of your problem. If I understood correctly, the following routine should give you what you're after or, failing that, hopefully get you pretty close.
I've used two rounds of stringr::str_replace_all
. In the first round, we replace all final
, 6fup
, and 12fup
suffixes with their indicated numerical equivalents (i.e. 15, 16, 17). In round two, we target the remaining two main regex patterns, making sure to exclude any matches that start with the cff
prefix.
# create new_names by applying two rounds of str_replace_all to the old names
new_names <- names(df) %>%
stringr::str_replace_all(c(
'final' = '15',
'_6fup|6fup' = '16',
'_12fup|12fup' = '17'
)) %>%
stringr::str_replace_all(
c(
'^(?!cff\\d)(^[A-z]+)(\\d{1,2})$' = 'T\\2.\\1',
'^(?!cff\\d)(^[A-z]+)(\\d{1,2})_(\\d)' = 'T\\2.\\1_\\3'
)
)
# compare old names to new names
new_names %>% purrr::set_names(names(df))
#> uci dob sex sp_episode staff
#> "uci" "dob" "sex" "sp_episode" "staff"
#> YP_consent date1 session1 dass1_1 dass1_2
#> "YP_consent" "T1.date" "T1.session" "T1.dass_1" "T1.dass_2"
#> dass1_3 siss1 diag1 diag2 diag3
#> "T1.dass_3" "T1.siss" "T1.diag" "T2.diag" "T3.diag"
#> pastpsyc pastmed date2 session2 dass2_1
#> "pastpsyc" "pastmed" "T2.date" "T2.session" "T2.dass_1"
#> dass2_2 dass2_3 siss2 datefinal sessionfinal
#> "T2.dass_2" "T2.dass_3" "T2.siss" "T15.date" "T15.session"
#> dassfinal_1 dassfinal_2 dassfinal_3 dassfinal_4 sissfinal
#> "T15.dass_1" "T15.dass_2" "T15.dass_3" "T15.dass_4" "T15.siss"
#> cff1 cff2 cff3 date_6fup dass6fup_2
#> "cff1" "cff2" "cff3" "T16.date" "T16.dass_2"
#> dass6fup_3 dass6fup_4 siss6fup date_12fup
#> "T16.dass_3" "T16.dass_4" "T16.siss" "T17.date"