Search code examples

Compare Cluster and Overall distributions of categorical variables using pie charts

In the context of cluster profiling, I am trying to visualize categorical variables distribution of each cluster compared to the overall population.

In order to make them comparable, I use the Relative Frequency.

For numerical variable is pretty straigthforward because I can easily overlay histograms.

Instead, for categorical variable I would like to obtain something like this:

enter image description here

In which the external piechart visualizes the Relative Frequency of Cluster 1 and the internal piechart represents the Relative Frequency of the Overall Population.

An reproducible example is:

mydf <- data.frame(week_day = as.factor(c(rep("monday",10), rep("monday",5), rep("tuesday",5))), cluster = c(rep(1,10), rep(2,10)))

Here, Cluster 1 is exclusively composed by "monday", whereas the Overall Population is composed 75% "monday" and 25% "tuesday".

The Relative Frequency within ggplot aes can be easily computed using:

y = (..count..)/sum(..count..)


  • Let's assume you are looking at a variable with 4 categories A B C D, and you have this sort of dataframe.

    d <- tribble(~Category, ~Overall, ~Cluster1,
             "A", 250, 20,
             "B", 250, 110,
             "C", 250, 30,
             "D", 250, 40) %>%
    gather(Overall, Cluster1, key = "Cluster", value = "Count")

    which would mean: "overall the dataset, 250 points have category A, 250 have category B, etc. and in the Cluster1, 20 points have category A, 110 have category B, etc.

    ggplot assumes a pie chart is a (scaled) bar chart plotted with polar coordinates.

    To get a bar chart with relative frequencies, specify a position = "fill" argument in geom_bar

    ggplot(data = d) +
    geom_bar(stat = "identity",
             position = "fill", #automatically scales the bars form 0 to 1, necessary for polar corrdinates
             aes(x = Cluster, y = Count, fill = Category))

    which gives you the following chart: Bar chart with relatives frequences

    Then, you need to switch to polar coordinates, and specify the y-axis as angular parameter. The radial parameters will be your clusters/overall distribution.

    You should pay attention to the order of factor levels, so that you get the right thing (here: the overall distribution) in the middle of the circles. My solution for the example is not meant to be optimal:

    d$Cluster <- factor(d$Cluster, levels = c("Overall","Cluster1"))
    #`Overall` has the lowest factor index to be displayed

    And then, add the coord_polar layer:

    ggplot(data = d) +
    geom_bar(stat = "identity",
             position = "fill", #automatically scales the bars form 0 to 1, necessary for polar corrdinates
             aes(x = Cluster, y = Count, fill = Category),
             width = .9) + #play with the width of the bins for the blank space between the circles. 1 = no blank space
    coord_polar(theta = "y") +#the y coordinated becomes the angular parameter
    theme(axis.text.y = element_blank()) #I didn't look for a fancy way to display radial labels

    Which gives you:

    Pie chart with relative frequences