Search code examples
rgraphggplot2inference

Why does my graph(using ggplot) vary by the use of as.factor() in R?


Im trying to use bar graph to observe the proportion of employees who left based on promotion.

Data: structure(list(promo = c(0, 0, 0, 0, 1, 1), left = c(0, 0, 0, 1, 0, 1)), .Names = c("promo", "left"), row.names = c(NA, -6L ), class = "data.frame")


Case 1: I used y = as.factor(left)

 ggplot(data = HR, aes(x = as.factor(promotion), y =as.factor(left), fill = factor(promotion), colour=factor(promotion))) + 
      geom_bar(stat="identity")+
      xlab('Promotion (True or False)')+
      ylab('The Employees that quit')+
      ggtitle('Comparision of Employees that resigned')

This produced the following graph.Case 1

Case 2: I used y = (left)

ggplot(data = HR, aes(x = as.factor(promotion), y = (left), fill = factor(promotion), colour=factor(promotion))) + 
      geom_bar(stat="identity")+
      xlab('Promotion (True or False)')+
      ylab('The Employees that quit')+
      ggtitle('Comparision of Employees that resigned')

This produced the following graph. Case 2

What causes this difference and which graph should I make inference from?


Solution

  • I'm making a guess that your data looks something like this. In the future, it's very good to share your data reproducibly so it can be copy/pasted like this. (dput() is useful to make a copy/pasteable version of an R object definition.)

    df = data.frame(promo = c(rep(0, 4), rep(1, 2)),
                    left = c(0, 0, 0, 1, 0, 1))
    df
    #   promo left
    # 1     0    0
    # 2     0    0
    # 3     0    0
    # 4     0    1
    # 5     1    0
    # 6     1    1
    

    Your problem isn't the factorness of left. No, your problem is actually that you specify stat = 'identity' in the geom_bar(). stat = 'identity' is used when data is pre-aggregated, that is, when your data frame has the exact values you want to show up in the plot. In this case, your data has 1s and 0s, not the total number each of 1s and 0s, so stat = 'identity' is inappropriate.

    In fact, you should not specify a y aesthetic at all because you do not have a column with y values - your left column has individual values that need to be aggregated to get y values, which is handled by geom_bar when stat is not 'identity'.

    For counts, the graph is as simple as this:

    ggplot(df, aes(x = factor(promo), fill = factor(left))) +
               geom_bar()
    

    enter image description here

    And to make it a percentage of the total in each case, we can switch to position = 'fill':

    ggplot(df, aes(x = factor(promo), fill = factor(left))) +
               geom_bar(position = 'fill')
    

    enter image description here


    If I'm incorrect in my assumption of how your data look, please provide some sample data in your question. Data is best shared either with code to create it (as above) or via dput().