I have a technical question regarding this example dataset (using RStudio) :
So I created a function that allows me to conduct descriptive analysis visualisation (it still needs some work) but for now it would look like this (with the use of boxplots as an example) :
library(ggplot2)
library(dplyr)
data("Salaries", package = "carData")
f <- function(x) {
lapply(X = Salaries %>% select_if(is.numeric), FUN = function(X) {
ggplot(Salaries, aes(x, y = X, fill = x, color = x)) +
geom_boxplot(col = "black")
})
}
lapply(Salaries %>% select_if(is.factor), FUN = function(X) f(X))
So now I am able to visualise boxplots of all possible categorical and continuous variables.
However, I am not able to find a way to make sure that I have different fill colours for each bloxplot. (I would appreciate to know how to apply fill colours automatically and manually).
Thanks.
Based on the OP's comments to my first answer, stating what they are really after, I now give a solution that integrates my previous answer with the OP's wishes.
Thus, this solution:
The solution is based on:
Salaries_factors
).The implementation of the f()
leverages this information and does the rest.
library(ggplot2)
library(dplyr)
f <- function(df, x_idx_name_depth, colors_palette) {
# Get the relevant information about the x variable to plot
# which will allow us to define the colors to use for the boxplots
x_info = unlist( strsplit(x_idx_name_depth, ",") )
idx_color_start = as.numeric(x_info[1]) # start position for the color set in the palette
xname = x_info[2]
n_colors = as.numeric(x_info[3]) # How many values the x variable takes
# Get the values of the x variable
x = df[[xname]]
# Define the color set to use for the boxplots
colors2use = setNames(colors_palette[idx_color_start:(idx_color_start+n_colors-1)],
names(table(x)))
# Define all the continuous variables to visualize (one at a time)
# with boxplots against the x variable
toplot = df %>% select_if(is.numeric)
lapply(
names(toplot), FUN = function(yname) {
y = toplot[[yname]]
print(ggplot(mapping=aes(x, y, fill=x)) +
geom_boxplot(color = "black") + xlab(xname) + ylab(yname) +
scale_fill_manual(values=colors2use, aesthetics="fill"))
}
)
}
# Data for analysis
data("Salaries", package = "carData")
# Data containing the factor variables used to group the boxplots
Salaries_factors = Salaries %>% select_if(is.factor)
# Characteristics of the factor variables which will help us
# define the color set in each boxplot group
factor_names = names(Salaries_factors)
n_factors = length(factor_names)
n_categories_by_factor = unlist(lapply(Salaries_factors, FUN=function(x) length(unique(x))))
n_categories = sum(n_categories_by_factor)
color_start_index_by_factor = setNames( c(1, 1+cumsum(n_categories_by_factor[1:(n_factors-1)])),
factor_names )
# Set smart names to the factor variables so that we can infer the information needed to
# define different (non-overlapping) color sets for the different boxplot groups.
# These names allow us to infer:
# - the order in which the factor variables are analyzed by the lapply() call
# --> this defines each color set.
# - the number of different values each factor variable takes (categories)
# --> this defines each color within each color set
# Ex: "4,discipline,2"
names(Salaries_factors) = paste(color_start_index_by_factor,
names(Salaries_factors),
n_categories_by_factor,
sep=",")
# Define the colors palette to use
colors_palette = terrain.colors(n=n_categories)
# Run the process
invisible(lapply(names(Salaries_factors),
FUN = function(factor_idx_name_depth)
f(Salaries, factor_idx_name_depth, colors_palette)))
Here I show the generated boxplots for the salary
variable in terms of the three factor variables: