Im trying to use bar graph to observe the proportion of employees who left based on promotion.
Data: structure(list(promo = c(0, 0, 0, 0, 1, 1), left = c(0, 0, 0, 1, 0, 1)), .Names = c("promo", "left"), row.names = c(NA, -6L ), class = "data.frame")
Case 1: I used y = as.factor(left)
ggplot(data = HR, aes(x = as.factor(promotion), y =as.factor(left), fill = factor(promotion), colour=factor(promotion))) +
geom_bar(stat="identity")+
xlab('Promotion (True or False)')+
ylab('The Employees that quit')+
ggtitle('Comparision of Employees that resigned')
This produced the following graph.Case 1
Case 2: I used y = (left)
ggplot(data = HR, aes(x = as.factor(promotion), y = (left), fill = factor(promotion), colour=factor(promotion))) +
geom_bar(stat="identity")+
xlab('Promotion (True or False)')+
ylab('The Employees that quit')+
ggtitle('Comparision of Employees that resigned')
This produced the following graph. Case 2
What causes this difference and which graph should I make inference from?
I'm making a guess that your data looks something like this. In the future, it's very good to share your data reproducibly so it can be copy/pasted like this. (dput()
is useful to make a copy/pasteable version of an R object definition.)
df = data.frame(promo = c(rep(0, 4), rep(1, 2)),
left = c(0, 0, 0, 1, 0, 1))
df
# promo left
# 1 0 0
# 2 0 0
# 3 0 0
# 4 0 1
# 5 1 0
# 6 1 1
Your problem isn't the factor
ness of left
. No, your problem is actually that you specify stat = 'identity'
in the geom_bar()
. stat = 'identity'
is used when data is pre-aggregated, that is, when your data frame has the exact values you want to show up in the plot. In this case, your data has 1s and 0s, not the total number each of 1s and 0s, so stat = 'identity'
is inappropriate.
In fact, you should not specify a y
aesthetic at all because you do not have a column with y
values - your left
column has individual values that need to be aggregated to get y
values, which is handled by geom_bar
when stat
is not 'identity'
.
For counts, the graph is as simple as this:
ggplot(df, aes(x = factor(promo), fill = factor(left))) +
geom_bar()
And to make it a percentage of the total in each case, we can switch to position = 'fill'
:
ggplot(df, aes(x = factor(promo), fill = factor(left))) +
geom_bar(position = 'fill')
If I'm incorrect in my assumption of how your data look, please provide some sample data in your question. Data is best shared either with code to create it (as above) or via dput()
.