Search code examples
rcomparisonsummary

R - how to summarise data across pairs?


I have a dataset with the following structure:

study treatment severity people_with_pain sample_size
0001 paracetamol 5.0 12 24
0001 aspirin 7.0 10 20
0001 massage 10.2 20 21
0002 paracetamol 6.0 . .
0002 aspirin 6.0 . .
0003 massage 2.0 10 25
0003 paracetamol 3.5 10 25

I would like to evaluate severity and the proportion of people_with_pain within sample_size across pairs of different treatments (e.g. paracetamol-aspirin). When the outcome is continuous (severity), I am interested in the mean, standard deviation, median and IQR. When the outcome is dichotomous (people_with_pain/sample_size), I am interested in the average proportion.

For instance, for the paracetamol-aspirin comparison, severity would be 6.0 (average between study 0001, 6.0, and study 0002, 6.0. Study 0003 does not contribute because it compares massage and paracetamol). The average proportion of people_with_pain would be 0.5 (study 0001, 22/44. Study 0002 does not contribute as there is no data).

The desired output would as a data.frame per output (e.g. one data.frame for severity, one for proportion of people_with_pain) with comparisons as rows (e.g. paracetamol-aspirin):

comparison nstudies severity_mean severity_sd
paracetamol-aspirin 2 6.0 0
paracetamol-massage 2 4.375 2.30
aspirin-massage 0 NA NA

I have managed to build a permutation list of available comparison and exclude those duplicated (e.g. paracetamol-aspirin and aspirin-paracetamol), as well as those not meaningful (e.g. paracetamol-paracetamol), but I am stuck there.


Solution

  • Here is my solution using tidyverse. I am not completely sure that this is what you want because of some differences between my results and your expected results. However, to obtain all the combinations of treatment I've used a left_join of the dataframe with itself, followed by filter(treatment.x < treatment.y) which basically eliminates all the treatment pair duplicates.

    library(dplyr)
    library(tidyr)
    dat |>
      left_join(dat, by = "study") |>
      filter(treatment.x < treatment.y) |>
      unite("treatment", starts_with("treatment"), sep = "-") |>
      group_by(treatment) |>
      summarize({
        severity <- c(severity.x, severity.y)
        people_with_pain <- c(people_with_pain.x, people_with_pain.y)
        sample_size <- c(sample_size.x, sample_size.y)
        data.frame(severity_mean =     mean(severity),
                   severity_sd  =        sd(severity),
                   severity_median = median(severity),
                   severity_IQR =       IQR(severity),
                   people_with_pain = sum(people_with_pain)/sum(sample_size),
                   nstudies = length(unique(study)))
      })
    
    ##>             treatment severity_mean severity_sd severity_median severity_IQR
    ##> 1     aspirin-massage         8.600   2.2627417            8.60        1.600
    ##> 2 aspirin-paracetamol         6.000   0.8164966            6.00        0.500
    ##> 3 massage-paracetamol         5.175   3.5668614            4.25        3.175
    ##>   people_with_pain nstudies
    ##> 1        0.7317073        1
    ##> 2               NA        2
    ##> 3        0.5473684        2