I am trying to do the following plot, the plot is an histogram with several bars for two groups.
My approach:
library(ggplot2)
library(dplyr)
graph2 |>
filter(!is.na(healthy)) |>
ggplot(aes(x = total_visits, fill = as.factor(healthy))) +
geom_histogram(aes(y = after_stat(count / sum(count))),
alpha = 0.6, color = "white", position = 'identity',
breaks = seq(0, 100, by = 1)) +
scale_x_continuous(breaks = seq(0, 100, 10)) +
scale_fill_manual(labels = c("TSCI", "SHS"), values = c("blue", "red")) +
labs(fill = "")
The dataset is a bit huge but I add a sample with just 200 rows:
graph2 <- structure(list(total_visits_SHS = structure(c(4, 2, NA, NA, 2,
4, 6, 3, 3, 1, 12, NA, 3, NA, 12, 2, 2, 1, 2, NA, NA, 12, 3,
8, 3, NA, 1, 1, NA, 4, NA, 6, NA, NA, 2, 5, NA, NA, 15, 10, NA,
51, NA, 3, NA, 3, 1, 5, 6, 2, 8, 12, 50, 1, 4, 2, 2, 30, NA,
16, 2, 10, NA, 2, 5, 1, NA, 10, 3, NA, 24, 1, 7, 10, 5, NA, 10,
2, 1, 20, 1, NA, 1, 2, 1, NA, 3, 1, 2, 3, 1, 20, 6, 11, 4, 1,
4, 2, 5, 24, 8, 2, NA, NA, 2, 1, 12, 30, NA, NA, 10, NA, 3, 1,
4, 2, NA, 6, NA, 7, 50, 60, NA, 1, 1, 6, 7, NA, 4, 2, NA, 6,
NA, 3, 3, 4, 10, 1, 6, 5, NA, 10, 1, NA, 1, 1, NA, 3, 12, 40,
1, 3, 6, 4, 3, 1, 2, 24, NA, NA, NA, 10, 12, 2, 1, 2, 2, 1, 1,
3, 18, 1, 4, 8, 4, 15, 4, 2, NA, 3, 20, NA, NA, NA, 3, 4, 2,
2, 2, 2, 2, 1, 1, NA, NA, 16, 1, 1, 7, NA), label = "number of medical consultation (last 12 months)", format.stata = "%9.0g"),
healthy = structure(c(0, 1, 1, 1, 0, 1, 1, 0, 1, 1, NA, 0,
0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1,
1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1,
0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,
1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0,
0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0,
1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1,
1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1,
1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1,
1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1), label = "Has no health condition", format.stata = "%9.0g"),
total_visits_SCI = structure(c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), label = "SwiSCI - Total healthcare vistis", format.stata = "%9.0g"),
total_visits = c(4, 2, 0, 0, 2, 4, 6, 3, 3, 1, 12, 0, 3,
0, 12, 2, 2, 1, 2, 0, 0, 12, 3, 8, 3, 0, 1, 1, 0, 4, 0, 6,
0, 0, 2, 5, 0, 0, 15, 10, 0, 51, 0, 3, 0, 3, 1, 5, 6, 2,
8, 12, 50, 1, 4, 2, 2, 30, 0, 16, 2, 10, 0, 2, 5, 1, 0, 10,
3, 0, 24, 1, 7, 10, 5, 0, 10, 2, 1, 20, 1, 0, 1, 2, 1, 0,
3, 1, 2, 3, 1, 20, 6, 11, 4, 1, 4, 2, 5, 24, 8, 2, 0, 0,
2, 1, 12, 30, 0, 0, 10, 0, 3, 1, 4, 2, 0, 6, 0, 7, 50, 60,
0, 1, 1, 6, 7, 0, 4, 2, 0, 6, 0, 3, 3, 4, 10, 1, 6, 5, 0,
10, 1, 0, 1, 1, 0, 3, 12, 40, 1, 3, 6, 4, 3, 1, 2, 24, 0,
0, 0, 10, 12, 2, 1, 2, 2, 1, 1, 3, 18, 1, 4, 8, 4, 15, 4,
2, 0, 3, 20, 0, 0, 0, 3, 4, 2, 2, 2, 2, 2, 1, 1, 0, 0, 16,
1, 1, 7, 0)), row.names = c(NA, -200L), label = "TEL17_CH", class = c("tbl_df",
"tbl", "data.frame"))
How to see better the values in x axis, specially the high values, in the plot generated by ggplot the bars are not seen. The plot that I created I made with all the data.
I would suggest to break y-axis in such kind of situations where the distribution of values are unclear, values are of different order of magnitude. It is something like zooming-in to better visualize low frequency values as in your dataset. One package that can deal with breaking axis is ggbreak and you can find more details in its tutorial. In your case I played with y values and found best in the range between (0.04, 0.1). The function is updated by adding a line before the last statement
:
library(tidyverse)
library(ggbreak)
graph2 |> filter(!is.na(healthy))|>
ggplot(aes(x=total_visits,fill=as.factor(healthy)))+
geom_histogram(aes(y = after_stat(count / sum(count))),
alpha=0.6,color="white", position = 'identity',
breaks = seq(0, 100, by = 1))+
scale_x_continuous(breaks = seq(0, 100, 10))+
scale_fill_manual(labels = c("TSCI", "SHS"), values = c("blue", "red"))+
scale_y_break(c(0.04 , 0.1), scales = .5) + theme_minimal() +
labs(fill="")
Here is the output but you may play with break points as you prefer.
or breaking y-axis between (0.04, 0.05):
Also the function scale_y_cut()
is another alternative from the package for adjusting the space parameter between sub-plots:
graph2 |> filter(!is.na(healthy))|>
ggplot(aes(x=total_visits,fill=as.factor(healthy)))+
geom_histogram(aes(y = after_stat(count / sum(count))),
alpha=0.6,color="white", position = 'identity',
breaks = seq(0, 100, by = 1))+
scale_x_continuous(breaks = seq(0, 100, 10))+
scale_fill_manual(labels = c("TSCI", "SHS"), values = c("blue", "red"))+
scale_y_cut(c(0.04)) + theme_minimal() +
labs(fill="")