I have a dataframe with responses to multiple questions (reproducible example with 2 questions below)
set.seed(1)
df <- data.frame (
UserId = c(rep("A", 4), rep("B", 4), rep("C", 4), rep("D", 4)),
Sex = c(rep("Female", 8), rep("Male", 4), rep("No_Response", 4)),
Answer_Date = as.Date(c("1990-01-01", "1990-02-01", "1990-03-01", "1990-04-01",
"1991-02-01", "1991-03-01", "1991-04-01", "1991-05-01",
"1992-03-01", "1992-04-01", "1992-05-01", "1992-06-01",
"1993-07-10", "1992-08-10", "1993-09-10", "1993-10-10")),
Q1 = sample(1:10, 16, replace = TRUE),
Q2 = sample(1:10, 16, replace = TRUE)
) %>%
group_by(UserId) %>%
mutate(First_Answer_Date = min(Answer_Date)) %>%
mutate(Last_Answer_Date = max(Answer_Date)) %>%
ungroup()
Following the suggestion in
https://sebastiansauer.github.io/multiple-t-tests-with-dplyr/
I run t-tests for Q1 and Q2 against the null hypothesis that the true mean is 0:
questions <- c("Q1", "Q2")
df %>%
select(questions, Sex) %>%
filter(Sex != "No_Response") %>%
gather(key = variable, value = value, -Sex) %>%
group_by(Sex, variable) %>%
summarize(value = list(value)) %>%
spread(Sex, value) %>%
group_by(variable) %>%
mutate( p_Female = t.test(unlist(Female))$p.value,
p_Male = t.test(unlist(Male) )$p.value,
t_Female = t.test(unlist(Female))$statistic,
t_Male = t.test(unlist(Male) )$statistic) %>%
mutate( Female = length(unlist(Female)),
Male = length(unlist(Male))
)
which gives me
# A tibble: 2 x 7
# Groups: variable [2]
variable Female Male p_Female p__Male t_Female t_Male
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 Q1 8 4 0.0000501 0.00137 8.78 11.6
2 Q2 8 4 0.00217 0.0115 4.71 5.55
All good so far. My troubles start when I want to do the t-test only on the First_Answer_Date.
df %>%
filter(Answer_Date == First_Answer_Date) %>%
select(questions, Sex) %>%
filter(Sex != "No_Response") %>%
# A tibble: 3 x 3
Q1 Q2 Sex
<int> <int> <chr>
1 9 5 Female
2 2 5 Female
3 1 9 Male
Now, there is only one response from a Male and two from a Female, and on Q2, both Female respondents have the same answer. If I rerun my t-test code
df %>%
filter(Answer_Date == First_Answer_Date) %>%
select(questions, Sex) %>%
filter(Sex != "No_Response") %>%
gather(key = variable, value = value, -Sex) %>%
group_by(Sex, variable) %>%
summarize(value = list(value)) %>%
spread(Sex, value) %>%
group_by(variable) %>%
mutate( p_Female = t.test(unlist(Female))$p.value,
p__Male = t.test(unlist(Male))$p.value,
t_Female = t.test(unlist(Female))$statistic,
t_Male = t.test(unlist(Male))$statistic) %>%
mutate( Female = length(unlist(Female)),
Male = length(unlist(Male)))
Error: Problem with `mutate()` input `p_Female`.
x data are essentially constant
i Input `p_Female` is `t.test(unlist(Female))$p.value`.
i The error occurred in group 2: variable = "Q2".
The error message I get is logical, but this is a situation that I am likely to encounter in practice - some subsets can be of size 1 or 0, all respondents to some questions are likely to give the same answer etc. etc. How can I make the code degrade gracefully, just putting a blank or NA in those cells in its output tibble where no answer can be computed for one reason or another?
Sincerely
Thomas Philips
Perhaps, you can use tryCatch
to handle the error :
library(dplyr)
library(tidyr)
df %>%
filter(Answer_Date == First_Answer_Date) %>%
select(questions, Sex) %>%
filter(Sex != "No_Response") %>%
pivot_longer(cols = -Sex, names_to = "variable") %>%
group_by(Sex, variable) %>%
summarize(value = list(value)) %>%
pivot_wider(names_from = Sex, values_from = value) %>%
group_by(variable) %>%
mutate( p_Female = tryCatch(t.test(unlist(Female))$p.value, error = function(e) return(NA)),
p_Male = tryCatch(t.test(unlist(Male) )$p.value, error = function(e) return(NA)),
t_Female = tryCatch(t.test(unlist(Female))$statistic, error = function(e) return(NA)),
t_Male = tryCatch(t.test(unlist(Male))$statistic,error = function(e) return(NA))) %>%
ungroup %>%
mutate( Female = lengths(Female),
Male = lengths(Male))