In the context of cluster profiling, I am trying to visualize categorical variables distribution of each cluster compared to the overall population.
In order to make them comparable, I use the Relative Frequency.
For numerical variable is pretty straigthforward because I can easily overlay histograms.
Instead, for categorical variable I would like to obtain something like this:
In which the external piechart visualizes the Relative Frequency
of Cluster 1
and the internal piechart represents the Relative Frequency
of the Overall Population
.
An reproducible example is:
mydf <- data.frame(week_day = as.factor(c(rep("monday",10), rep("monday",5), rep("tuesday",5))), cluster = c(rep(1,10), rep(2,10)))
Here, Cluster 1
is exclusively composed by "monday
", whereas the Overall Population
is composed 75% "monday
" and 25% "tuesday
".
The Relative Frequency
within ggplot
aes
can be easily computed using:
y = (..count..)/sum(..count..)
Let's assume you are looking at a variable with 4 categories A B C D, and you have this sort of dataframe.
d <- tribble(~Category, ~Overall, ~Cluster1,
"A", 250, 20,
"B", 250, 110,
"C", 250, 30,
"D", 250, 40) %>%
gather(Overall, Cluster1, key = "Cluster", value = "Count")
which would mean: "overall the dataset, 250 points have category A, 250 have category B, etc. and in the Cluster1, 20 points have category A, 110 have category B, etc.
ggplot assumes a pie chart is a (scaled) bar chart plotted with polar coordinates.
To get a bar chart with relative frequencies, specify a position = "fill"
argument in geom_bar
ggplot(data = d) +
geom_bar(stat = "identity",
position = "fill", #automatically scales the bars form 0 to 1, necessary for polar corrdinates
aes(x = Cluster, y = Count, fill = Category))
which gives you the following chart: Bar chart with relatives frequences
Then, you need to switch to polar coordinates, and specify the y-axis as angular parameter. The radial parameters will be your clusters/overall distribution.
You should pay attention to the order of factor levels, so that you get the right thing (here: the overall distribution) in the middle of the circles. My solution for the example is not meant to be optimal:
d$Cluster <- factor(d$Cluster, levels = c("Overall","Cluster1"))
#`Overall` has the lowest factor index to be displayed
And then, add the coord_polar
layer:
ggplot(data = d) +
geom_bar(stat = "identity",
position = "fill", #automatically scales the bars form 0 to 1, necessary for polar corrdinates
aes(x = Cluster, y = Count, fill = Category),
width = .9) + #play with the width of the bins for the blank space between the circles. 1 = no blank space
coord_polar(theta = "y") +#the y coordinated becomes the angular parameter
theme(axis.text.y = element_blank()) #I didn't look for a fancy way to display radial labels
Which gives you: