In R, I have a dataframe with the answers to a questionnaire, and due to technical reasons the data could only be provided as the answer labels, so not the answer codes.
The answers can be Never
, Sometimes
, Always
which corresponds with the answer codes 0
, 1
, 3
, and these codes can be used to calculate a total score.
I've isolated a test dataset (the full data has more columns) and the R syntax I've got so far.
df <- read.table(text = "
DateOfBirth,Sex,Question_01,Question_02,Question_03,Question_04
12-03-2001,M,Sometimes,Sometimes,Never,Never
21-08-1988,F,Always,Never,Always,Sometimes
30-11-1985,F,Always,Always,Sometimes,Never
01-04-2001,M,Sometimes,Sometimes,Never,Never
12-12-1986,M,Always,Sometimes,Always,Never
19-02-1982,F,Always,Always,Never,Sometimes
11-05-1980,M,Sometimes,Sometimes,Never,Never
01-06-2000,F,Always,Sometimes,Always,Always
20-11-1981,F,Never,Never,Never,Sometimes
30-07-1982,M,Never,Never,Sometimes,Never
", header = TRUE, sep = ",", na.strings = "")
# Levels F, M (alphabetically sorted -> 1, 2)
df$Sex_fact <- factor(df$Sex)
lookup_freq <- c(
"0" = "Never",
"1" = "Sometimes",
"3" = "Always"
)
df$Q1 <- factor(df$Question_01, levels = lookup_freq, labels = names(lookup_freq))
df$Q2 <- factor(df$Question_02, levels = lookup_freq, labels = names(lookup_freq))
df$Q3 <- factor(df$Question_03, levels = lookup_freq, labels = names(lookup_freq))
df$Q4 <- factor(df$Question_04, levels = lookup_freq, labels = names(lookup_freq))
df$Total_score <- as.numeric(df$Q1) + as.numeric(df$Q2) + as.numeric(df$Q3) + as.numeric(df$Q4)
print(df)
The Total_score
should just be all the answers added up, according to the 0
, 1
, 3
codes. The result I get is incorrect, and I understand that this is because it uses the numeric values of the factor, which is just 1
, 2
, 3
. But I don't know how to set the levels to the desired codes, using the list I've got in lookup_freq
.
DateOfBirth Sex Question_01 Question_02 Question_03 Question_04 Sex_fact Q1 Q2 Q3 Q4 Total_score
1 12-03-2001 M Sometimes Sometimes Never Never M 1 1 0 0 6
2 21-08-1988 F Always Never Always Sometimes F 3 0 3 1 9
3 30-11-1985 F Always Always Sometimes Never F 3 3 1 0 9
4 01-04-2001 M Sometimes Sometimes Never Never M 1 1 0 0 6
5 12-12-1986 M Always Sometimes Always Never M 3 1 3 0 9
6 19-02-1982 F Always Always Never Sometimes F 3 3 0 1 9
7 11-05-1980 M Sometimes Sometimes Never Never M 1 1 0 0 6
8 01-06-2000 F Always Sometimes Always Always F 3 1 3 3 11
9 20-11-1981 F Never Never Never Sometimes F 0 0 0 1 5
10 30-07-1982 M Never Never Sometimes Never M 0 0 1 0 5
So my question is, is it possible to create a Factor column with the labels Never
, Sometimes
, Always
but with the underlying levels as 0
, 1
, 3
? Or is it better to just create new columns and replace the character label values with the integer score values?
Or what is the best way to use a lookup variable, similar to lookup_freq
? Because I've got more questionnaires with similarly coded answer columns, sometimes with 4 or 5 answer possibilities.
A hack, but you can create factors such that they resolve to 1, 2, and 4 (instead of 0, 1, and 3) and subtract 1 for each column you're reference.
### ...
lookup_freq <- c("Never", "Sometimes", "Who Cares", "Always")
as.numeric(factor(c("Never", "Sometimes", "Always"), levels=lookup_freq)) - 1
# [1] 0 1 3
df$Q1 <- factor(df$Question_01, levels = lookup_freq)
df$Q2 <- factor(df$Question_02, levels = lookup_freq)
df$Q3 <- factor(df$Question_03, levels = lookup_freq)
df$Q4 <- factor(df$Question_04, levels = lookup_freq)
(as.numeric(df$Q1) + as.numeric(df$Q2) + as.numeric(df$Q3) + as.numeric(df$Q4)) - 4
# [1] 2 7 7 2 7 7 2 10 1 1
# or more clearly
rowSums(sapply(df[,c("Q1","Q2","Q3","Q4")], as.numeric)-1)
# [1] 2 7 7 2 7 7 2 10 1 1
# original `df` with no factors
lookup_dict <- c("Never"=0, "Sometimes"=1, "Always"=3)
rowSums(sapply(df[,3:6], function(i) lookup_dict[i]))
# Sometimes Always Always Sometimes Always Always Sometimes Always Never Never
# 2 7 7 2 7 7 2 10 1 1