A subtle problem with t-tests over multiple columns

I have a dataframe with responses to multiple questions (reproducible example with 2 questions below)

set.seed(1)
df <- data.frame (
          UserId = c(rep("A", 4), rep("B", 4), rep("C", 4), rep("D", 4)),
          Sex = c(rep("Female", 8), rep("Male", 4), rep("No_Response", 4)),
          Answer_Date = as.Date(c("1990-01-01", "1990-02-01", "1990-03-01", "1990-04-01",
                                  "1991-02-01", "1991-03-01", "1991-04-01", "1991-05-01",
                                  "1992-03-01", "1992-04-01", "1992-05-01", "1992-06-01",
                                  "1993-07-10", "1992-08-10", "1993-09-10", "1993-10-10")),
          Q1 = sample(1:10, 16, replace = TRUE),
          Q2 = sample(1:10, 16, replace = TRUE)
      ) %>%
      group_by(UserId) %>%
      mutate(First_Answer_Date = min(Answer_Date)) %>%
      mutate(Last_Answer_Date  = max(Answer_Date)) %>%
      ungroup()

Following the suggestion in

https://sebastiansauer.github.io/multiple-t-tests-with-dplyr/

I run t-tests for Q1 and Q2 against the null hypothesis that the true mean is 0:

questions <- c("Q1", "Q2")
df %>%
  select(questions, Sex) %>%
  filter(Sex != "No_Response") %>%
  gather(key = variable, value = value, -Sex) %>%
  group_by(Sex, variable) %>%
  summarize(value = list(value)) %>%
  spread(Sex, value) %>%
  group_by(variable) %>%
  mutate( p_Female = t.test(unlist(Female))$p.value,
          p_Male   = t.test(unlist(Male)  )$p.value,
          t_Female = t.test(unlist(Female))$statistic,
          t_Male   = t.test(unlist(Male)  )$statistic) %>%
  mutate( Female = length(unlist(Female)),
          Male   = length(unlist(Male))
  )

which gives me

# A tibble: 2 x 7
# Groups:   variable [2]
  variable Female  Male  p_Female p__Male t_Female t_Male
  <chr>     <int> <int>     <dbl>   <dbl>    <dbl>  <dbl>
1 Q1            8     4 0.0000501 0.00137     8.78  11.6 
2 Q2            8     4 0.00217   0.0115      4.71   5.55

All good so far. My troubles start when I want to do the t-test only on the First_Answer_Date.

df %>%
  filter(Answer_Date == First_Answer_Date) %>%
  select(questions, Sex) %>%
  filter(Sex != "No_Response") %>%

    # A tibble: 3 x 3
         Q1    Q2 Sex   
      <int> <int> <chr> 
    1     9     5 Female
    2     2     5 Female
    3     1     9 Male

Now, there is only one response from a Male and two from a Female, and on Q2, both Female respondents have the same answer. If I rerun my t-test code

df %>%
  filter(Answer_Date == First_Answer_Date) %>%
  select(questions, Sex) %>%
  filter(Sex != "No_Response") %>%
  gather(key = variable, value = value, -Sex) %>%
  group_by(Sex, variable) %>%
  summarize(value = list(value)) %>%
  spread(Sex, value) %>%
  group_by(variable) %>%
  mutate( p_Female = t.test(unlist(Female))$p.value,
          p__Male = t.test(unlist(Male))$p.value,
          t_Female = t.test(unlist(Female))$statistic,
          t_Male = t.test(unlist(Male))$statistic) %>%
  mutate( Female = length(unlist(Female)),
          Male   = length(unlist(Male)))

Error: Problem with `mutate()` input `p_Female`.
x data are essentially constant
i Input `p_Female` is `t.test(unlist(Female))$p.value`.
i The error occurred in group 2: variable = "Q2".

The error message I get is logical, but this is a situation that I am likely to encounter in practice - some subsets can be of size 1 or 0, all respondents to some questions are likely to give the same answer etc. etc. How can I make the code degrade gracefully, just putting a blank or NA in those cells in its output tibble where no answer can be computed for one reason or another?

Sincerely

Thomas Philips

Solution

Perhaps, you can use tryCatch to handle the error :

library(dplyr)
library(tidyr)

df %>%
  filter(Answer_Date == First_Answer_Date) %>%
  select(questions, Sex) %>%
  filter(Sex != "No_Response") %>%
  pivot_longer(cols = -Sex, names_to = "variable") %>%
  group_by(Sex, variable) %>%
  summarize(value = list(value)) %>%
  pivot_wider(names_from = Sex, values_from = value) %>%
  group_by(variable) %>%
  mutate( p_Female = tryCatch(t.test(unlist(Female))$p.value, error = function(e) return(NA)),
          p_Male   = tryCatch(t.test(unlist(Male) )$p.value, error = function(e) return(NA)),
          t_Female = tryCatch(t.test(unlist(Female))$statistic, error = function(e) return(NA)),
          t_Male   = tryCatch(t.test(unlist(Male))$statistic,error = function(e) return(NA))) %>%
  ungroup %>%
  mutate( Female = lengths(Female),
          Male   = lengths(Male))