Error in using group_by and summarise in running correlation and test of significance with SPSS dataset

I borrow a dataset from SPSS prepared by Julie Pallant's SPSS Survival Manual and run it on R.

I select three columns to run correlation and significance test: toptim, tnegaff, sex. I select the columns using select: df <- survey %>% select(toptim, tnegaff, sex).

Then, problems emerge.

I'd like to know the correlation between toptim and tnegaff by sex. But I can't use cor and resort to correlate. Why is there error and any difference between the two methods?

df %>% group_by(sex) %>% summarise(cor = correlate(toptim, tnegaff)) <- OK (male = 0.22 female = 0.394)

df %>% group_by(sex) %>% summarise(cor = cor(toptim, tnegaff)) <- failed, returns with NA

I failed to obtain the test of significance with cor.test (The answer should be p = 0.0488)

Error in `summarise()`:
! Problem while computing `cor = cor.test(toptim, tnegaff)`.
✖ `cor` must be a vector, not a `htest` object.
ℹ The error occurred in group 1: sex = 1.

Then I try to follow past examples and use broom::tidy, but no output for p-values....

> df %>% group_by(sex) %>% broom::tidy(cor.test(toptim, tnegaff))
# A tibble: 3 × 13
  column      n  mean    sd median trimmed   mad   min   max range   skew kurtosis     se
  <chr>   <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl>  <dbl>
1 toptim    435 22.1  4.43      22   22.3      3     7    30    23 NA        NA    0.212 
2 tnegaff   435 19.4  7.07      18   18.6      4    10    39    29 NA        NA    0.339 
3 sex       439  1.58 0.494      2    1.58     0     1     2     1 -0.318     1.10 0.0236

How can I get the result? May I know the reason for such failure?

Thank you for your answers in advance.

Solution

It's trying to use all values and coming across NAs I presume. If you set to use "complete.obs" then it should work. For the cor.test part wrap the output in a list function to use the tibble's capabilities to have a column of a vector of objects.

For the final tidying and getting p-values, use map(cor.test, broom::tidy) then tidyr::unnest() to get a full and tidy dataframe.

That's a few steps to go through but hope it helps!

df <- haven::read_sav("survey.sav")

library(tidyverse)

df %>% 
  group_by(sex) %>%
  summarise(cor = cor(toptim, tnegaff, use = "complete.obs"),
            cor.test = list(cor.test(toptim, tnegaff))) %>%
  mutate(tidy_out = map(cor.test, broom::tidy)) %>%
  unnest(tidy_out)
#> # A tibble: 2 × 11
#>   sex        cor cor.t…¹ estim…² stati…³  p.value param…⁴ conf.…⁵ conf.…⁶ method
#>   <dbl+l>  <dbl> <list>    <dbl>   <dbl>    <dbl>   <int>   <dbl>   <dbl> <chr> 
#> 1 1 [MAL… -0.220 <htest>  -0.220   -3.04 2.73e- 3     182  -0.353 -0.0775 Pears…
#> 2 2 [FEM… -0.394 <htest>  -0.394   -6.75 1.06e-10     248  -0.494 -0.284  Pears…
#> # … with 1 more variable: alternative <chr>, and abbreviated variable names
#> #   ¹cor.test, ²estimate, ³statistic, ⁴parameter, ⁵conf.low, ⁶conf.high

Edit - examining difference in correlation

Borrowing the function from here you can examine the difference in correlation coefficients between sexes like this:

cor.diff.test(df$toptim[df$sex == 1], df$tnegaff[df$sex == 1], df$toptim[df$sex == 2], df$tnegaff[df$sex == 2])