I have a bunch of separate code chunks to run normality tests in R, and I would like to be able to combine them so that I can test specific variables without copying the code each time. So far, all the individual code chunks are working (using the iris dataset as an example):
library(datasets)
library(tidyverse)
library(skimr)
data(iris)
iris$Species <- NULL
# descriptive statistics and normality tests
skim(iris$Sepal.Length)
round(stat.desc(iris$Sepal.Length, basic = FALSE, norm = TRUE), digits = 3)
# histogram with normality curve
hist_sepal_length <- ggplot(iris, aes(Sepal.Length)) +
geom_histogram(aes(y = ..density..), bins = 10, colour = "black", fill = "white") +
labs(x = "Sepal.Length", y = "Density") +
stat_function(fun = dnorm, args = list(mean = mean(iris$Sepal.Length), sd = sd(iris$Sepal.Length)), colour = "black", size = 1)
hist_sepal_length
# qqplot
qqplot_sepal_length <- qplot(sample = iris$Sepal.Length)
qqplot_sepal_length
I can do the first step of the descriptive statistics using sapply
round(sapply(iris, stat.desc, basic = FALSE, norm = TRUE), digits = 3)
However, I am not sure how to use any of the apply functions with ggplot2. I looked at the following questions:
How to use lapply with ggplot2 while indexing variables
using an apply function with ggplot2 to create bar plots for more than one variable in a data.frame
Using apply functions with ggplot to plot a subset of dataframe columns
Using lapply to make boxplots of a variable list
However, none of them quite cover what I want, since my ggplot also includes a stat_function which references the variable. I also would like the output in separate graphs. Is there a way to write the ggplot code so that it will run through all of the variables at once (so sepal length, sepal width, petal length, petal width)? I have the variables that I want to run the normality tests on already saved to a separate dataframe, so there's no need to subset.
Finally, is there a way that I could package the 3 steps together (normality tests, histogram, and qq plot) into one function?
The goal here is to try to replace whenever you have Sepal.Length
for a generic variable. After that, you can make a function and call it for each variable. Then, it's simple to generalize a loop that will return all the results at once.
library(datasets)
library(tidyverse)
library(skimr)
library(pastecs)
data(iris)
#-- Function
testVarNormality <- function(var, data) {
# descriptive statistics and normality tests
skim_res <- skim(data[,var])
desc_stats <- round(stat.desc(data[,var], basic = FALSE, norm = TRUE), digits = 3)
# histogram with normality curve
hist <- ggplot(data, aes_string(var)) +
geom_histogram(aes(y = ..density..), bins = 10, colour = "black", fill = "white") +
labs(x = var, y = "Density") +
stat_function(fun = dnorm, args = list(mean = mean(data[,var]), sd = sd(data[,var])), colour = "black", size = 1)
# qqplot
qqplot <- qplot(sample = data[,var])
list(skim_res = skim_res, desc_stats = desc_stats, histogram = hist, qqplot = qqplot)
}
#-- 1 function call
sepal_length_res <- testVarNormality("Sepal.Length", iris)
sepal_length_res$histogram
sepal_length_res$qqplot
#-- Calling for all columns (except species)
all_res <- lapply(colnames(iris)[1:4], testVarNormality, iris)
names(all_res) <- colnames(iris)[1:4]
#-- Get a result example
all_res$Sepal.Width$histogram
How to do it by species:
irisBySpecies <- split(iris, iris$Species)
#-- Nested list
res_byGroup <- lapply(
irisBySpecies,
function(species_data) {
res4species <- lapply(colnames(species_data)[1:4], testVarNormality, species_data)
names(res4species) <- colnames(iris)[1:4]
return(res4species)
}
)
names(res_byGroup) <- names(irisBySpecies)
Note that I had to make an anonymous function to do this. There are probably more elegant ways to do the code for the original function that would make it easier to apply per group, but this way is quite generalizable.