Search code examples
rggplot2data-visualizationboxplotfill

descriptive analysis : different fill colours for the visualisation using lapply


I have a technical question regarding this example dataset (using RStudio) :

So I created a function that allows me to conduct descriptive analysis visualisation (it still needs some work) but for now it would look like this (with the use of boxplots as an example) :

library(ggplot2)
library(dplyr)

data("Salaries", package = "carData")

f <- function(x) {
  lapply(X = Salaries %>% select_if(is.numeric), FUN = function(X) {
    ggplot(Salaries, aes(x, y = X, fill = x, color = x)) +
      geom_boxplot(col = "black")
  })
}

lapply(Salaries %>% select_if(is.factor), FUN = function(X) f(X))

So now I am able to visualise boxplots of all possible categorical and continuous variables.

However, I am not able to find a way to make sure that I have different fill colours for each bloxplot. (I would appreciate to know how to apply fill colours automatically and manually).

Thanks.


Solution

  • Based on the OP's comments to my first answer, stating what they are really after, I now give a solution that integrates my previous answer with the OP's wishes.

    Thus, this solution:

    • shows the variable labels in each plot (as done already by the solution in my first answer) (not requested but good to have)
    • uses a different color set for the boxplots in each analyzed factor (requested)

    The solution is based on:

    1. Gathering relevant information about the factor variables, namely: how many there are, how many categories per factor variable, how many categories in total.
    2. Storing related information as part of the names of the factor variables in the data frame of factor variables (Salaries_factors).
    3. Defining a color palette with as many colors as the total number of categories across all factor variables.

    The implementation of the f() leverages this information and does the rest.


    library(ggplot2)
    library(dplyr)
    
    f <- function(df, x_idx_name_depth, colors_palette) {
      # Get the relevant information about the x variable to plot
      # which will allow us to define the colors to use for the boxplots
      x_info = unlist( strsplit(x_idx_name_depth, ",") )
      idx_color_start = as.numeric(x_info[1])  # start position for the color set in the palette
      xname = x_info[2]
      n_colors = as.numeric(x_info[3])  # How many values the x variable takes
      
      # Get the values of the x variable
      x = df[[xname]]
      
      # Define the color set to use for the boxplots
      colors2use = setNames(colors_palette[idx_color_start:(idx_color_start+n_colors-1)],
                            names(table(x)))
    
      # Define all the continuous variables to visualize (one at a time)
    # with boxplots against the x variable
      toplot = df %>% select_if(is.numeric)
      lapply(
        names(toplot), FUN = function(yname) {
          y = toplot[[yname]]
          print(ggplot(mapping=aes(x, y, fill=x)) +
                  geom_boxplot(color = "black") + xlab(xname) + ylab(yname) +
                  scale_fill_manual(values=colors2use, aesthetics="fill"))
        }
      )
    }
    
    # Data for analysis
    data("Salaries", package = "carData")
    
    # Data containing the factor variables used to group the boxplots
    Salaries_factors = Salaries %>% select_if(is.factor)
    
    # Characteristics of the factor variables which will help us
    # define the color set in each boxplot group 
    factor_names = names(Salaries_factors)
    n_factors = length(factor_names)
    n_categories_by_factor = unlist(lapply(Salaries_factors, FUN=function(x) length(unique(x))))
    n_categories = sum(n_categories_by_factor)
    color_start_index_by_factor = setNames( c(1, 1+cumsum(n_categories_by_factor[1:(n_factors-1)])),
                                            factor_names )
    
    # Set smart names to the factor variables so that we can infer the information needed to
    # define different (non-overlapping) color sets for the different boxplot groups.
    # These names allow us to infer:
    # - the order in which the factor variables are analyzed by the lapply() call
    #   --> this defines each color set.
    # - the number of different values each factor variable takes (categories)
    #   --> this defines each color within each color set
    # Ex: "4,discipline,2"
    names(Salaries_factors) = paste(color_start_index_by_factor,
                                    names(Salaries_factors),
                                    n_categories_by_factor,
                                    sep=",")
    
    # Define the colors palette to use
    colors_palette = terrain.colors(n=n_categories)
    
    # Run the process
    invisible(lapply(names(Salaries_factors),
                     FUN = function(factor_idx_name_depth)
                              f(Salaries, factor_idx_name_depth, colors_palette)))
    

    Here I show the generated boxplots for the salary variable in terms of the three factor variables:

    rank factor discipline factor sex factor