Search code examples
rdplyrplyrconfidence-interval

ddply summarise with Confidence interval?


I am trying to summarise my data with ddply, and I am trying to find a way I could summarise the data while reflecting the reliability.

Here is a desciption of my data set.

  BSTN ASTN  BSEC ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5 TFtime Ttime    ID
1 1001 1003 69551  1703  1703     0     0     0     0     0     0    399  2933 35404
2 1001 1006 69664  1703  1703     0     0     0     0     0     0    399  2284 35405
3 1001 1701 66606  1703  1703     0     0     0     0     0     0    118  1750 35406
4 1001 1701 66600  1703  1703     0     0     0     0     0     0    118  1750 35406
5 1001 1701 66601  1703  1703     0     0     0     0     0     0    118  1750 35406
6 1001 1703 69434     0     0     0     0     0     0     0     0      0  1005 35407

What I would like as my output is a to summarise the values of Ttime and TFtime grouped by "ASTN"s and "BSTN"s.

For the mean values of "Ttime" and "TFtime" I would like to reflect the confidence interval 95%. So calculate the mean values of "Ttime" and "TFtime"s within the 95% boundary. How would I do this process with ddply if there are multiple combinations of BSTN~ASTNs.

below is the code I used and wish to revise.

Routetable<-ddply(A,c(.(BSTN),.(ASTN1),.(BSTN2),.(ASTN2),.(BSTN3),.(ASTN3),.(BSTN4),.(ASTN4),.(BSTN5),.(ASTN)), 
                    summarise, count=length(BSTN),mean=mean(Ttime),TFtimemean=mean(TFtime))

Solution

  • updated answer

    I'm not sure, but I guess what you actually want to do is filter all values that are larger / smaller than mean(x) -/+ 2*sd(x) and this by each group. The following approach would do that. In the case of ggplot2s Diamond data set it keeps about 97% of all values and just removes the extremes.

    library(tidyverse)
    
    diamonds %>% 
      group_by(cut, color) %>% 
      mutate(across(c(x,y,z),
                    list(low = ~ mean(.x, na.rm = TRUE) - 2 * sd(.x, na.rm = TRUE),
                         high = ~ mean(.x, na.rm = TRUE) + 2 * sd(.x, na.rm = TRUE))
                    )
             ) %>% 
      filter(x >= x_low & x <= x_high,
             y >= x_low & y <= y_high,
             z >= z_low & z <= z_high)
    #> # A tibble: 52,299 x 16
    #> # Groups:   cut, color [35]
    #>    carat cut   color clarity depth table price     x     y     z x_low x_high
    #>    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>  <dbl>
    #>  1 0.23  Ideal E     SI2      61.5    55   326  3.95  3.98  2.43  3.51   6.92
    #>  2 0.21  Prem~ E     SI1      59.8    61   326  3.89  3.84  2.31  3.52   7.65
    #>  3 0.290 Prem~ I     VS2      62.4    58   334  4.2   4.23  2.63  3.86   9.12
    #>  4 0.31  Good  J     SI2      63.3    58   335  4.34  4.35  2.75  4.14   8.62
    #>  5 0.24  Very~ I     VVS1     62.3    57   336  3.95  3.98  2.47  3.92   8.62
    #>  6 0.26  Very~ H     SI1      61.9    55   337  4.07  4.11  2.53  3.66   8.30
    #>  7 0.23  Very~ H     VS1      59.4    61   338  4     4.05  2.39  3.66   8.30
    #>  8 0.3   Good  J     SI1      64      55   339  4.25  4.28  2.73  4.14   8.62
    #>  9 0.23  Ideal J     VS1      62.8    56   340  3.93  3.9   2.46  3.88   8.76
    #> 10 0.31  Ideal J     SI2      62.2    54   344  4.35  4.37  2.71  3.88   8.76
    #> # ... with 52,289 more rows, and 4 more variables: y_low <dbl>, y_high <dbl>,
    #> #   z_low <dbl>, z_high <dbl>
    

    Created on 2020-06-23 by the reprex package (v0.3.0)

    old answer

    With better example data we could achieve a more programmatic approach. As example I use ggplot2s diamonds dataset. See my comments in the code below.

    library(tidyverse)
    
    diamonds %>% 
      # set up your groups
      nest_by(cut, color) %>%  
      # define in `across` for which variables you want means and conf int to be calculated 
      mutate(ttest = list(summarise(data, across(c(x,y,z), 
                                              ~ broom::tidy(t.test(.x))))),
             ttest = list(unpack(ttest, c(x, y, z), names_sep = "_") %>% 
       # select only the estimates and conf intervalls                          
                            select(contains("estimate"), contains("conf")))) %>% 
      unnest(ttest)
    #> # A tibble: 35 x 12
    #> # Groups:   cut, color [35]
    #>    cut   color      data x_estimate y_estimate z_estimate x_conf.low x_conf.high
    #>    <ord> <ord> <list<tb>      <dbl>      <dbl>      <dbl>      <dbl>       <dbl>
    #>  1 Fair  D     [163 × 8]       6.02       5.96       3.84       5.89        6.15
    #>  2 Fair  E     [224 × 8]       5.91       5.86       3.72       5.80        6.02
    #>  3 Fair  F     [312 × 8]       5.99       5.93       3.79       5.89        6.09
    #>  4 Fair  G     [314 × 8]       6.17       6.11       3.96       6.06        6.28
    #>  5 Fair  H     [303 × 8]       6.58       6.50       4.22       6.47        6.69
    #>  6 Fair  I     [175 × 8]       6.56       6.49       4.19       6.43        6.70
    #>  7 Fair  J     [119 × 8]       6.75       6.68       4.32       6.55        6.95
    #>  8 Good  D     [662 × 8]       5.62       5.63       3.50       5.55        5.69
    #>  9 Good  E     [933 × 8]       5.62       5.63       3.50       5.56        5.68
    #> 10 Good  F     [909 × 8]       5.69       5.71       3.54       5.63        5.76
    #> # … with 25 more rows, and 4 more variables: y_conf.low <dbl>,
    #> #   y_conf.high <dbl>, z_conf.low <dbl>, z_conf.high <dbl>
    

    Created on 2020-06-19 by the reprex package (v0.3.0)

    If you want to filter observations based on the confidence iIntervalls of the means you can adjust my approach above as follows. Note that this is not the same as filtering the top and bottom 2.5 % of each variable, you will loose a lot of data.

    library(tidyverse)
    
    diamonds %>% 
      nest_by(cut, color) %>% 
      mutate(ttest = summarise(data, across(c(x,y,z), 
                                                 ~ broom::tidy(t.test(.x)))) %>% 
             unpack(c(x,y,z), names_sep = "_")) %>% 
      unpack(ttest) %>% 
      select(cut, color, data, contains("estimate"), contains("conf")) %>% 
      rowwise(cut, color) %>% 
      mutate(data = list(filter(data,
                           x >= x_conf.low & x <= x_conf.high,
                           y >= x_conf.low & y <= y_conf.high,
                           z >= z_conf.low & z <= z_conf.high))) %>% 
      unnest(data)
    #> # A tibble: 322 x 19
    #> # Groups:   cut, color [30]
    #>    cut   color carat clarity depth table price     x     y     z x_estimate
    #>    <ord> <ord> <dbl> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>      <dbl>
    #>  1 Fair  D      0.91 SI2      62.5    66  3079  6.08  6.01  3.78       6.02
    #>  2 Fair  D      0.9  SI2      65.7    60  3205  5.98  5.93  3.91       6.02
    #>  3 Fair  D      0.9  SI2      64.7    59  3205  6.09  5.99  3.91       6.02
    #>  4 Fair  D      0.95 SI2      64.4    60  3384  6.06  6.02  3.89       6.02
    #>  5 Fair  D      0.9  SI2      64.9    57  3473  6.03  5.98  3.9        6.02
    #>  6 Fair  D      0.9  SI2      64.5    61  3473  6.1   6     3.9        6.02
    #>  7 Fair  D      0.9  SI1      64.5    61  3689  6.05  6.01  3.89       6.02
    #>  8 Fair  D      0.91 SI1      64.7    61  3730  6.06  5.99  3.9        6.02
    #>  9 Fair  D      0.9  SI2      64.6    59  3847  6.04  6.01  3.89       6.02
    #> 10 Fair  D      0.91 SI1      64.4    60  3855  6.08  6.04  3.9        6.02
    #> # ... with 312 more rows, and 8 more variables: y_estimate <dbl>,
    #> #   z_estimate <dbl>, x_conf.low <dbl>, x_conf.high <dbl>, y_conf.low <dbl>,
    #> #   y_conf.high <dbl>, z_conf.low <dbl>, z_conf.high <dbl>
    

    Created on 2020-06-22 by the reprex package (v0.3.0)