I am intended to create a script that can generate a statistical information, grouped based on category (which in the example below; SDP & M346). I need the n, mean, standard deviation (sd), standard error (se) and confidence interval. I have the code ready for all except the confidence interval. I could not manage to embed the calculation of the confidence interval in group using aggregate
function.
tab.sep.file
I was trying to print the mean/sd/se/confidence interval grouped by the column SDP
and M346
.
aggregate(tab.sep.file$Hic8, FUN=mean, by=list(SDP=tab.sep.file$SDP, trait=tab.sep.file$M346))
By switching the function to length, standard deviation and creating formula for standard error, I managed to obtain the statistics for grouped data in one go. The output for the script above is something as below:
To calculate the confidence interval for the entire data (not grouping):
mean <- mean(tab.sep.file$Hic8)
n <- length(tab.sep.file$Hic8)
standard_deviation <- sd(tab.sep.file$Hic8)
standard_error <- standard_deviation / sqrt(n)
alpha = 0.05
degrees_of_freedom = n - 1
t_score = qt(p=alpha/2, df=degrees_of_freedom, lower.tail=F)
margin_error <- t_score * standard_error
# Calculating lower bound and upper bound
lower_bound <- mean_value - margin_error
upper_bound <- mean_value + margin_error
# Print the confidence interval
print(c(lower_bound,upper_bound))
embedded from [https://www.geeksforgeeks.org/how-to-find-confidence-intervals-in-r/][3]
But how should I calculate the confidence interval based on the grouped data using the aggregate
function in one go like the rest?
My main problem here is that instead of having one n
, in grouped data, there will definitely having more than one n
.
I have tried using dplyr
tool too, but the output is incorrect, something seemed off.
> tab.sep.file %>%
+ group_by(SDP,M346) %>%
+ summarise(mean = mean(tab.sep.file$Hic8),
+ sd = sd(tab.sep.file$Hic8),
+ n = n()) %>%
+ mutate(se = sd / sqrt(n),
+ lower.ci = mean - qt(1 - (0.05 / 2 ), n -1) * se,
+ upper.ci = mean + qt(1 - (0.05 / 2 ), n -1) * se)
You need specify grouping in aggregate
. e.g. via formula. Try this
> aggf <- \(x) {
+ n <- length(x)
+ xmn <- mean(x)
+ xsd <- sd(x)
+ se <- xsd/sqrt(n)
+ ci <- xmn + se*(-qt(p=.05/2, df=n - 1)) %*% cbind(-1, 1)
+ c(mean=xmn, sd=xsd, se=se, ci=ci)
+ }
> aggregate(Hic8 ~ sdp, dat, aggf)
sdp Hic8.mean Hic8.sd Hic8.se Hic8.ci1 Hic8.ci2
1 SDP1 0.64358665 0.20498872 0.08368630 0.42846418 0.85870913
2 SDP2 0.32132755 0.16655345 0.09615968 -0.09241416 0.73506925
3 SDP3 0.46302347 0.37685995 0.21758019 -0.47314854 1.39919549
Data
> dput(dat)
structure(list(sdp = c("SDP1", "SDP1", "SDP1", "SDP2", "SDP2",
"SDP2", "SDP1", "SDP1", "SDP1", "SDP3", "SDP3", "SDP3"), Hic8 = c(0.854164426913485,
0.382015778450295, 0.799342160811648, 0.144684031140059, 0.475511683383957,
0.343786930665374, 0.629821318900213, 0.412683063885197, 0.783493172377348,
0.534960983321071, 0.798729537520558, 0.0553799034096301)), class = "data.frame", row.names = c(NA,
-12L))