Search code examples
rdplyrtabular

4 Tabular questions including error with duplicates R


I have a dataframe with a set of answers to a questionnaire

DAILY_QUESTIONS  <- c("Q1_Daily",  "Q2_Daily",   "Q3_Daily")

as well as the respondent's sex. Here's a toy dataset that illustrates my data:

       UserId Days_From_First_Use    Sex Q1_Daily Q2_Daily Q3_Daily Region
1       A                   0 Female        3        2        1     US
2       A                   1 Female        1        0        0     US
3       A                   2 Female        1        1        0     US
4       A                   3 Female        0        2        0     US
5       A                   4 Female        1        1        1     US
6       B                   0   Male        4        8        2     CA
7       B                   2   Male        2        2        1     CA
8       B                   4   Male        5        6        5     CA
9       B                   5   Male        4        5        5     CA
10      C                   0 Female        5        7        2     UK
11      C                   1 Female        2        2        2     UK
12      C                   2 Female        5        5        4     UK
13      C                   3 Female        6        5        3     UK
14      C                   4 Female        6        6        4     UK
15      D                   0   Male        5        3        5     US
16      D                   1   Male        5        3        4     US
17      D                   2   Male        4        2        6     US
18      D                   3   Male        0        0        1     US
19      D                   4   Male        1        1        1     US

I create a tabular summary of this using the command

 tabular((Sex + 1) ~ (n=1) + Format(digits=2)*(Q1_Daily + Q2_Daily + Q3_Daily)*(mean + sd), df_tmp)

and get the correct output:

               Q1_Daily     Q2_Daily     Q3_Daily    
 Sex    n  mean     sd  mean     sd  mean     sd 
 Female 10 3.0      2.3 3.1      2.4 1.7      1.6
 Male    9 3.3      1.9 3.3      2.5 3.3      2.1
 All    19 3.2      2.1 3.2      2.4 2.5      2.0

I now have four questions:

  1. In my real dataframe, I have not 2, but 21 different questions, and they don't all take the form of Q_Daily. How can I avoid writing a string with all 21 questions? I have collected them into a vector using my_questions <- c("Q1_Daily", "Q2_Daily", ...., "Grand_Total"). Is there a way to feed my_questions to the tabular expresssion

  2. How can I transpose the table so that the column headers are Female and Male, and each question is associated with two rows. Typing tranpose(tabular(..)) or dcast(melt(tablular(...))) gets me gibberish and an error message respectively

  3. My real dataframe is much larger (and I have only shown a few lines and 3 questios)

    df_test

A tibble: 2,418 x 6

UserId Sex Days_From_First_Use Q1_Daily Q2_Daily Q3_Daily 1 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 0 days 3 2 1 2 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 1 days 1 0 0 3 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 2 days 1 1 0 4 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 3 days 0 2 0 5 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 4 days 1 1 1 6 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 5 days 2 3 0 7 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 6 days 1 1 0 8 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 16 days 0 1 0 9 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 17 days 0 0 0 10 45ae06dd-37eb-492c-9a57-ca75d29d91f1 Female 18 days 0 0 0

... with 2,408 more rows

But if I try and use the same tabular function on it, I get an error message:

tabular((Sex + 1) ~ (n=1) + Format(digits=2)*(Q1_Daily + Q2_Daily )*(mean + sd), as.data.frame(df_test))
Error in term2table(rows[[i]], cols[[j]], data, n) : 
  Duplicate values: Sex and Q1_Daily

How do i fix this?

  1. Finally, how can I write my tabular output to an Excel workbook using openxlsx, whose writeData function seems to works orly with dataframes? Trying to coerce the tabular object to a dataframe using as.data.frame does not work.

Apologies for this long question and many thanks in advance for your help

Thomas Philips


Solution

  • I would suggest you to use an approach based on dplyr as it can be easy to manage a large number of questions. Your names will change but results will be the same. Here the code:

    library(dplyr)
    #Functions
    total <- function(x) {sum(!is.na(x))}
    stats <- c('mean', 'sd', 'total')
    #Code
    new <- df %>% select(Sex,contains('Daily')) %>%
      group_by(Sex) %>% summarise_all(.funs = stats) %>%
      bind_rows(
        df %>% select(contains('Daily')) %>%
          summarise_all(.funs = stats) %>%
          mutate(Sex='Total')
      ) %>%
      rename(Myvar=Sex) %>%
      select(sort(current_vars()))
    

    Output:

    # A tibble: 3 x 10
      Myvar Q1_Daily_mean Q1_Daily_sd Q1_Daily_total Q2_Daily_mean Q2_Daily_sd
      <chr>         <dbl>       <dbl>          <int>         <dbl>       <dbl>
    1 Fema~          3           2.31             10          3.1         2.42
    2 Male           3.33        1.87              9          3.33        2.55
    3 Total          3.16        2.06             19          3.21        2.42
    # ... with 4 more variables: Q2_Daily_total <int>, Q3_Daily_mean <dbl>,
    #   Q3_Daily_sd <dbl>, Q3_Daily_total <int>
    

    Some data used:

    #Data
    df <- structure(list(UserId = c("A", "A", "A", "A", "A", "B", "B", 
    "B", "B", "C", "C", "C", "C", "C", "D", "D", "D", "D", "D"), 
        Days_From_First_Use = c(0L, 1L, 2L, 3L, 4L, 0L, 2L, 4L, 5L, 
        0L, 1L, 2L, 3L, 4L, 0L, 1L, 2L, 3L, 4L), Sex = c("Female", 
        "Female", "Female", "Female", "Female", "Male", "Male", "Male", 
        "Male", "Female", "Female", "Female", "Female", "Female", 
        "Male", "Male", "Male", "Male", "Male"), Q1_Daily = c(3L, 
        1L, 1L, 0L, 1L, 4L, 2L, 5L, 4L, 5L, 2L, 5L, 6L, 6L, 5L, 5L, 
        4L, 0L, 1L), Q2_Daily = c(2L, 0L, 1L, 2L, 1L, 8L, 2L, 6L, 
        5L, 7L, 2L, 5L, 5L, 6L, 3L, 3L, 2L, 0L, 1L), Q3_Daily = c(1L, 
        0L, 0L, 0L, 1L, 2L, 1L, 5L, 5L, 2L, 2L, 4L, 3L, 4L, 5L, 4L, 
        6L, 1L, 1L), Region = c("US", "US", "US", "US", "US", "CA", 
        "CA", "CA", "CA", "UK", "UK", "UK", "UK", "UK", "US", "US", 
        "US", "US", "US")), class = "data.frame", row.names = c("1", 
    "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
    "14", "15", "16", "17", "18", "19"))