Search code examples
rtidyversecorrelationsummarize

Error in using group_by and summarise in running correlation and test of significance with SPSS dataset


I borrow a dataset from SPSS prepared by Julie Pallant's SPSS Survival Manual and run it on R.

I select three columns to run correlation and significance test: toptim, tnegaff, sex. I select the columns using select: df <- survey %>% select(toptim, tnegaff, sex).

Then, problems emerge.

  1. I'd like to know the correlation between toptim and tnegaff by sex. But I can't use cor and resort to correlate. Why is there error and any difference between the two methods?

df %>% group_by(sex) %>% summarise(cor = correlate(toptim, tnegaff)) <- OK (male = 0.22 female = 0.394)

df %>% group_by(sex) %>% summarise(cor = cor(toptim, tnegaff)) <- failed, returns with NA

  1. I failed to obtain the test of significance with cor.test (The answer should be p = 0.0488)
Error in `summarise()`:
! Problem while computing `cor = cor.test(toptim, tnegaff)`.
✖ `cor` must be a vector, not a `htest` object.
ℹ The error occurred in group 1: sex = 1.

Then I try to follow past examples and use broom::tidy, but no output for p-values....

> df %>% group_by(sex) %>% broom::tidy(cor.test(toptim, tnegaff))
# A tibble: 3 × 13
  column      n  mean    sd median trimmed   mad   min   max range   skew kurtosis     se
  <chr>   <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl>  <dbl>
1 toptim    435 22.1  4.43      22   22.3      3     7    30    23 NA        NA    0.212 
2 tnegaff   435 19.4  7.07      18   18.6      4    10    39    29 NA        NA    0.339 
3 sex       439  1.58 0.494      2    1.58     0     1     2     1 -0.318     1.10 0.0236

How can I get the result? May I know the reason for such failure?

Thank you for your answers in advance.


Solution

  • It's trying to use all values and coming across NAs I presume. If you set to use "complete.obs" then it should work. For the cor.test part wrap the output in a list function to use the tibble's capabilities to have a column of a vector of objects.

    For the final tidying and getting p-values, use map(cor.test, broom::tidy) then tidyr::unnest() to get a full and tidy dataframe.

    That's a few steps to go through but hope it helps!

    df <- haven::read_sav("survey.sav")
    
    library(tidyverse)
    
    df %>% 
      group_by(sex) %>%
      summarise(cor = cor(toptim, tnegaff, use = "complete.obs"),
                cor.test = list(cor.test(toptim, tnegaff))) %>%
      mutate(tidy_out = map(cor.test, broom::tidy)) %>%
      unnest(tidy_out)
    #> # A tibble: 2 × 11
    #>   sex        cor cor.t…¹ estim…² stati…³  p.value param…⁴ conf.…⁵ conf.…⁶ method
    #>   <dbl+l>  <dbl> <list>    <dbl>   <dbl>    <dbl>   <int>   <dbl>   <dbl> <chr> 
    #> 1 1 [MAL… -0.220 <htest>  -0.220   -3.04 2.73e- 3     182  -0.353 -0.0775 Pears…
    #> 2 2 [FEM… -0.394 <htest>  -0.394   -6.75 1.06e-10     248  -0.494 -0.284  Pears…
    #> # … with 1 more variable: alternative <chr>, and abbreviated variable names
    #> #   ¹​cor.test, ²​estimate, ³​statistic, ⁴​parameter, ⁵​conf.low, ⁶​conf.high
    

    Edit - examining difference in correlation

    Borrowing the function from here you can examine the difference in correlation coefficients between sexes like this:

    cor.diff.test(df$toptim[df$sex == 1], df$tnegaff[df$sex == 1], df$toptim[df$sex == 2], df$tnegaff[df$sex == 2])