I have a dataframe with a set of answers to a questionnaire
DAILY_QUESTIONS <- c("Q1_Daily", "Q2_Daily", "Q3_Daily")
as well as the respondent's sex. Here's a toy dataset that illustrates my data:
UserId Days_From_First_Use Sex Q1_Daily Q2_Daily Q3_Daily Region
1 A 0 Female 3 2 1 US
2 A 1 Female 1 0 0 US
3 A 2 Female 1 1 0 US
4 A 3 Female 0 2 0 US
5 A 4 Female 1 1 1 US
6 B 0 Male 4 8 2 CA
7 B 2 Male 2 2 1 CA
8 B 4 Male 5 6 5 CA
9 B 5 Male 4 5 5 CA
10 C 0 Female 5 7 2 UK
11 C 1 Female 2 2 2 UK
12 C 2 Female 5 5 4 UK
13 C 3 Female 6 5 3 UK
14 C 4 Female 6 6 4 UK
15 D 0 Male 5 3 5 US
16 D 1 Male 5 3 4 US
17 D 2 Male 4 2 6 US
18 D 3 Male 0 0 1 US
19 D 4 Male 1 1 1 US
I create a tabular summary of this using the command
tabular((Sex + 1) ~ (n=1) + Format(digits=2)*(Q1_Daily + Q2_Daily + Q3_Daily)*(mean + sd), df_tmp)
and get the correct output:
Q1_Daily Q2_Daily Q3_Daily
Sex n mean sd mean sd mean sd
Female 10 3.0 2.3 3.1 2.4 1.7 1.6
Male 9 3.3 1.9 3.3 2.5 3.3 2.1
All 19 3.2 2.1 3.2 2.4 2.5 2.0
I now have four questions:
In my real dataframe, I have not 2, but 21 different questions, and they don't all take the form of Q_Daily. How can I avoid writing a string with all 21 questions? I have collected them into a vector using my_questions <- c("Q1_Daily", "Q2_Daily", ...., "Grand_Total")
. Is there a way to feed my_questions to the tabular expresssion
How can I transpose the table so that the column headers are Female and Male, and each question is associated with two rows. Typing tranpose(tabular(..)) or dcast(melt(tablular(...))) gets me gibberish and an error message respectively
My real dataframe is much larger (and I have only shown a few lines and 3 questios)
df_test
UserId Sex Days_From_First_Use Q1_Daily Q2_Daily Q3_Daily 1 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 0 days 3 2 1 2 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 1 days 1 0 0 3 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 2 days 1 1 0 4 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 3 days 0 2 0 5 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 4 days 1 1 1 6 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 5 days 2 3 0 7 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 6 days 1 1 0 8 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 16 days 0 1 0 9 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 17 days 0 0 0 10 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 18 days 0 0 0
But if I try and use the same tabular function on it, I get an error message:
tabular((Sex + 1) ~ (n=1) + Format(digits=2)*(Q1_Daily + Q2_Daily )*(mean + sd), as.data.frame(df_test))
Error in term2table(rows[[i]], cols[[j]], data, n) :
Duplicate values: Sex and Q1_Daily
How do i fix this?
Apologies for this long question and many thanks in advance for your help
Thomas Philips
I would suggest you to use an approach based on dplyr
as it can be easy to manage a large number of questions. Your names will change but results will be the same. Here the code:
library(dplyr)
#Functions
total <- function(x) {sum(!is.na(x))}
stats <- c('mean', 'sd', 'total')
#Code
new <- df %>% select(Sex,contains('Daily')) %>%
group_by(Sex) %>% summarise_all(.funs = stats) %>%
bind_rows(
df %>% select(contains('Daily')) %>%
summarise_all(.funs = stats) %>%
mutate(Sex='Total')
) %>%
rename(Myvar=Sex) %>%
select(sort(current_vars()))
Output:
# A tibble: 3 x 10
Myvar Q1_Daily_mean Q1_Daily_sd Q1_Daily_total Q2_Daily_mean Q2_Daily_sd
<chr> <dbl> <dbl> <int> <dbl> <dbl>
1 Fema~ 3 2.31 10 3.1 2.42
2 Male 3.33 1.87 9 3.33 2.55
3 Total 3.16 2.06 19 3.21 2.42
# ... with 4 more variables: Q2_Daily_total <int>, Q3_Daily_mean <dbl>,
# Q3_Daily_sd <dbl>, Q3_Daily_total <int>
Some data used:
#Data
df <- structure(list(UserId = c("A", "A", "A", "A", "A", "B", "B",
"B", "B", "C", "C", "C", "C", "C", "D", "D", "D", "D", "D"),
Days_From_First_Use = c(0L, 1L, 2L, 3L, 4L, 0L, 2L, 4L, 5L,
0L, 1L, 2L, 3L, 4L, 0L, 1L, 2L, 3L, 4L), Sex = c("Female",
"Female", "Female", "Female", "Female", "Male", "Male", "Male",
"Male", "Female", "Female", "Female", "Female", "Female",
"Male", "Male", "Male", "Male", "Male"), Q1_Daily = c(3L,
1L, 1L, 0L, 1L, 4L, 2L, 5L, 4L, 5L, 2L, 5L, 6L, 6L, 5L, 5L,
4L, 0L, 1L), Q2_Daily = c(2L, 0L, 1L, 2L, 1L, 8L, 2L, 6L,
5L, 7L, 2L, 5L, 5L, 6L, 3L, 3L, 2L, 0L, 1L), Q3_Daily = c(1L,
0L, 0L, 0L, 1L, 2L, 1L, 5L, 5L, 2L, 2L, 4L, 3L, 4L, 5L, 4L,
6L, 1L, 1L), Region = c("US", "US", "US", "US", "US", "CA",
"CA", "CA", "CA", "UK", "UK", "UK", "UK", "UK", "US", "US",
"US", "US", "US")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19"))