I have a dataset with >1000 observations belonging to either group A or group B, and ~150 categorical and continuous variables. Small version below.
set.seed(16)
mydf <- data.frame(ID = 1:50, group = sample(c("A", "B"), 50, replace = TRUE), length = rnorm(n = 50, mean = 0, sd = 1), weight = runif(50, min=0, max=1), color = sample(c("red", "orange", "yellow", "green", "blue"), 50, replace = TRUE), size = sample(c("big", "small"), 50, replace = TRUE))
I would like to visually compare group A and group B across each of the variables. To start I would like to make boxplot pairs showing A and B side by side for each continuous variable, and the same using bar plots for each categorical variable. Thinking that ggplot facet_grid would be ideal for this but not sure how to specify plot type according to data tyep, also not sure how to do this without specifying each variable one-by-one.
Interested in ggplot2 help and any alternative exploration techniques.
Exploring our data is arguably the most interesting and intellectually challenging part of our research, so I encourage you to do some more reading into this topic.
Visualisation is of course important. @Parfait has suggested to shape your data long, which makes plotting easier. Your mix of continuous and categorical data is a bit tricky. Beginners often try very hard to avoid reshaping their data - but there is no need to fret! In the contrary, you will find that most questions require a specific shape of your data, and you will in most cases not find a "one fits all" shape.
So - the real challenge is how to shape your data before plotting. There are obviously many ways of doing this. Below one way, which should help "automatically" reshape columns that are continuous and those that are categorical. Comments in the code.
As a side note, when loading your data into R, I'd try to avoid storing categorical data as factors, and to convert to factors only when you need it. How to do this depends how you load your data. If it is from a csv, you can for example use read.csv('your.csv', stringsAsFactors = FALSE)
library(tidyverse)
``` r
# gathering numeric columns (without ID which is numeric).
# [I'd recommend against numeric IDs!!])
data_num <-
mydf %>%
select(-ID) %>%
pivot_longer(cols = which(sapply(., is.numeric)), names_to = 'key', values_to = 'value')
#No need to use facet here
ggplot(data_num) +
geom_boxplot(aes(key, value, color = group))
# selecting categorical columns is a bit more tricky in this example,
# because your group is also categorical.
# One way:
# first convert all categorical columns to character,
# then turn your "group" into factor
# then gather the character columns:
# gathering numeric columns (without ID which is numeric).
# [I'd recommend against numeric IDs!!])
# I use simple count() and mutate() to create a summary data frame with the proportions and geom_col, which equals geom_bar('stat = identity')
# There may be neater ways, but this is pretty straight forward
data_cat <-
mydf %>% select(-ID) %>%
mutate_if(.predicate = is.factor, .funs = as.character) %>%
mutate(group = factor(group)) %>%
pivot_longer(cols = which(sapply(., is.character)), names_to = 'key', values_to = 'value')%>%
count(group, key, value) %>%
group_by(group, key) %>%
mutate(percent = n/ sum(n)) %>%
ungroup # I always 'ungroup' after my data manipulations, in order to avoid unexpected effects
ggplot(data_cat) +
geom_col(aes(group, percent, fill = key)) +
facet_grid(~ value)
Created on 2020-01-07 by the reprex package (v0.3.0)
Credit how to gather conditionally goes to this answer from @H1