Search code examples
rggplot2lapplysapply

How to combine multiple R functions using apply functions including ggplot2


I have a bunch of separate code chunks to run normality tests in R, and I would like to be able to combine them so that I can test specific variables without copying the code each time. So far, all the individual code chunks are working (using the iris dataset as an example):

library(datasets)
library(tidyverse)
library(skimr)

data(iris)
iris$Species <- NULL

# descriptive statistics and normality tests
skim(iris$Sepal.Length)
round(stat.desc(iris$Sepal.Length, basic = FALSE, norm = TRUE), digits = 3)

# histogram with normality curve
hist_sepal_length <- ggplot(iris, aes(Sepal.Length)) + 
  geom_histogram(aes(y = ..density..), bins = 10, colour = "black", fill = "white") +
  labs(x = "Sepal.Length", y = "Density") + 
  stat_function(fun = dnorm, args = list(mean = mean(iris$Sepal.Length), sd = sd(iris$Sepal.Length)), colour = "black", size = 1)
hist_sepal_length

# qqplot
qqplot_sepal_length <- qplot(sample = iris$Sepal.Length)
qqplot_sepal_length

I can do the first step of the descriptive statistics using sapply

round(sapply(iris, stat.desc, basic = FALSE, norm = TRUE), digits = 3)

However, I am not sure how to use any of the apply functions with ggplot2. I looked at the following questions:

How to use lapply with ggplot2 while indexing variables

using an apply function with ggplot2 to create bar plots for more than one variable in a data.frame

Using apply functions with ggplot to plot a subset of dataframe columns

Using lapply to make boxplots of a variable list

However, none of them quite cover what I want, since my ggplot also includes a stat_function which references the variable. I also would like the output in separate graphs. Is there a way to write the ggplot code so that it will run through all of the variables at once (so sepal length, sepal width, petal length, petal width)? I have the variables that I want to run the normality tests on already saved to a separate dataframe, so there's no need to subset.

Finally, is there a way that I could package the 3 steps together (normality tests, histogram, and qq plot) into one function?


Solution

  • The goal here is to try to replace whenever you have Sepal.Length for a generic variable. After that, you can make a function and call it for each variable. Then, it's simple to generalize a loop that will return all the results at once.

    library(datasets)
    library(tidyverse)
    library(skimr)
    library(pastecs)
    
    data(iris)
    #-- Function
    testVarNormality <- function(var, data) {
        # descriptive statistics and normality tests
        skim_res <- skim(data[,var])
        desc_stats <- round(stat.desc(data[,var], basic = FALSE, norm = TRUE), digits = 3)
    
        # histogram with normality curve
        hist <- ggplot(data, aes_string(var)) + 
            geom_histogram(aes(y = ..density..), bins = 10, colour = "black", fill = "white") +
            labs(x = var, y = "Density") + 
            stat_function(fun = dnorm, args = list(mean = mean(data[,var]), sd = sd(data[,var])), colour = "black", size = 1)
    
        # qqplot
        qqplot <- qplot(sample = data[,var])
    
        list(skim_res = skim_res, desc_stats = desc_stats, histogram = hist, qqplot = qqplot)
    }
    #-- 1 function call
    sepal_length_res <- testVarNormality("Sepal.Length", iris)
    sepal_length_res$histogram
    sepal_length_res$qqplot
    
    #-- Calling for all columns (except species)
    all_res <- lapply(colnames(iris)[1:4], testVarNormality, iris)
    names(all_res) <- colnames(iris)[1:4]
    #-- Get a result example
    all_res$Sepal.Width$histogram
    

    ----

    How to do it by species:

    irisBySpecies <- split(iris, iris$Species)
    #-- Nested list
    
        res_byGroup <- lapply(
            irisBySpecies,
            function(species_data) {
                res4species <- lapply(colnames(species_data)[1:4], testVarNormality, species_data)
                names(res4species) <- colnames(iris)[1:4]
                return(res4species)
            }
        )
        names(res_byGroup) <- names(irisBySpecies)
    

    Note that I had to make an anonymous function to do this. There are probably more elegant ways to do the code for the original function that would make it easier to apply per group, but this way is quite generalizable.