Search code examples
rstatisticsaggregateconfidence-intervalgroup

R calculate confidence interval for mean in aggregate, by function


I am intended to create a script that can generate a statistical information, grouped based on category (which in the example below; SDP & M346). I need the n, mean, standard deviation (sd), standard error (se) and confidence interval. I have the code ready for all except the confidence interval. I could not manage to embed the calculation of the confidence interval in group using aggregate function.

tab.sep.file

enter image description here

I was trying to print the mean/sd/se/confidence interval grouped by the column SDP and M346.

aggregate(tab.sep.file$Hic8, FUN=mean, by=list(SDP=tab.sep.file$SDP, trait=tab.sep.file$M346))

By switching the function to length, standard deviation and creating formula for standard error, I managed to obtain the statistics for grouped data in one go. The output for the script above is something as below: enter image description here

To calculate the confidence interval for the entire data (not grouping):

mean <- mean(tab.sep.file$Hic8)
n <- length(tab.sep.file$Hic8)
standard_deviation <- sd(tab.sep.file$Hic8)
standard_error <- standard_deviation / sqrt(n)
alpha = 0.05
degrees_of_freedom = n - 1
t_score = qt(p=alpha/2, df=degrees_of_freedom, lower.tail=F)
margin_error <- t_score * standard_error
 
# Calculating lower bound and upper bound
lower_bound <- mean_value - margin_error
upper_bound <- mean_value + margin_error
 
# Print the confidence interval
print(c(lower_bound,upper_bound))

embedded from [https://www.geeksforgeeks.org/how-to-find-confidence-intervals-in-r/][3] 

But how should I calculate the confidence interval based on the grouped data using the aggregate function in one go like the rest?

My main problem here is that instead of having one n, in grouped data, there will definitely having more than one n.

I have tried using dplyr tool too, but the output is incorrect, something seemed off.

> tab.sep.file %>%
+ group_by(SDP,M346) %>%
+ summarise(mean = mean(tab.sep.file$Hic8),
+ sd = sd(tab.sep.file$Hic8),
+ n = n()) %>%
+ mutate(se = sd / sqrt(n),
+ lower.ci = mean - qt(1 - (0.05 / 2 ), n -1) * se,
+ upper.ci = mean + qt(1 - (0.05 / 2 ), n -1) * se)

Solution

  • You need specify grouping in aggregate. e.g. via formula. Try this

    > aggf <- \(x) {
    +   n <- length(x)
    +   xmn <- mean(x)
    +   xsd <- sd(x)
    +   se <- xsd/sqrt(n)
    +   ci <- xmn + se*(-qt(p=.05/2, df=n - 1)) %*% cbind(-1, 1)
    +   c(mean=xmn, sd=xsd, se=se, ci=ci)
    + }
    > aggregate(Hic8 ~ sdp, dat, aggf)
       sdp   Hic8.mean     Hic8.sd     Hic8.se    Hic8.ci1    Hic8.ci2
    1 SDP1  0.64358665  0.20498872  0.08368630  0.42846418  0.85870913
    2 SDP2  0.32132755  0.16655345  0.09615968 -0.09241416  0.73506925
    3 SDP3  0.46302347  0.37685995  0.21758019 -0.47314854  1.39919549
    

    Data

    > dput(dat)
    structure(list(sdp = c("SDP1", "SDP1", "SDP1", "SDP2", "SDP2", 
    "SDP2", "SDP1", "SDP1", "SDP1", "SDP3", "SDP3", "SDP3"), Hic8 = c(0.854164426913485, 
    0.382015778450295, 0.799342160811648, 0.144684031140059, 0.475511683383957, 
    0.343786930665374, 0.629821318900213, 0.412683063885197, 0.783493172377348, 
    0.534960983321071, 0.798729537520558, 0.0553799034096301)), class = "data.frame", row.names = c(NA, 
    -12L))