I wanted to create a custom function to calculate confidence intervals of a column by creating two columns called lower.bound and upper.bound. I also wanted this function to be able to work within dplyr::summarize() function.
The function works as expected in all tested circumstances, but it does not when the column is named "x". When it is it draws a warning and returns NaN values. It only works when the column is specifically declared as .$x. Here is an example of the code. I don't understand the nuance... could you point me to the right direction in understanding this?
set.seed(12)
# creates random data frame
z <- data.frame(
x = runif(100),
y = runif(100),
z = runif(100)
)
# creates function to calculate confidence intervals
conf.int <- function(x, alpha = 0.05) {
sample.mean <- mean(x)
sample.n <- length(x)
sample.sd <- sd(x)
sample.se <- sample.sd / sqrt(sample.n)
t.score <- qt(p = alpha / 2,
df = sample.n - 1,
lower.tail = F)
margin.error <- t.score * sample.se
lower.bound <- sample.mean - margin.error
upper.bound <- sample.mean + margin.error
as.data.frame(cbind(lower.bound, upper.bound))
}
# This works as expected
z %>%
summarise(x = mean(y), conf.int(y))
# This does not
z %>%
summarise(x = mean(x), conf.int(x))
# This does
z %>%
summarise(x = mean(x), conf.int(.$x))
Thanks!
This is a "feature" in dplyr
which makes the updated value of x
(which has the mean value) is available when you pass it to conf.int
function.
Possible options are -
library(dplyr)
z %>% summarise(x1 = mean(x), conf.int(x))
# x1 lower.bound upper.bound
#1 0.4797154 0.4248486 0.5345822
z %>% summarise(conf.int(x), x = mean(x))
# lower.bound upper.bound x
#1 0.4248486 0.5345822 0.4797154