I have a simple ggplot bar plot which displays information about school expenses. It retrieves it's information from a data-frame with the following columns:
You can take a closer look at this data at the end of this post (csv format).
Each bar in my plot represents a different purchase location. The bars stacks multiple colours for each purchase made (proportional to its amount). Here is a look at my plot:
As you can see, the scaling is clearly off (the 10.28 tick is about a third as high as the 215.25 tick in the y axis).
How should I go about making the scaling accurate and what is causing this inaccurate y axis?
Here is my raw csv file:
"DATE" ;"MONTANT";"LIEU" ;"CAUSE"
"2020-01-25"; 67.17;"Coop Cégep" ;"Notes de cours"
"2020-02-24"; 7.67;"Coop Cégep" ;"Notes de cours"
"2020-01-30"; 10.28;"Coop Cégep" ;"Cahiers d'exercices"
"2020-03-02"; 215.25;"Omnivox (Cégep Lanaudière)";"Frais de scholarité"
"2020-01-22"; 114.60;"Coop Cégep" ;"Romans, Notes de cours"
"2020-08-27"; 78.33;"Coop Cégep" ;"Romans, Notes de cours"
"<++>" ; <++>;"<++>" ;"<++>"
Here is the code I used to generate this image:
#!/bin/Rscript
# LIBRARIES ----
library(ggplot2)
library(RColorBrewer)
# CSV's ----
expenses <- head(data.frame(read.csv("paiements.csv", header=TRUE, sep=";")), -1)
expenses$DATE <- as.Date(expenses$DATE)
# PLOTS ----
# Bar plot with different expenses sorted by location
expenses_df <- ggplot(expenses, aes(LIEU, MONTANT, fill=MONTANT)) +
geom_bar(stat="identity") +
geom_jitter(width=0.1, height=0, shape=18, size=4) +
labs(
title="Montants de diverses dépenses scholaires",
x="Lieu",
y="Montant") +
theme(plot.title = element_text(hjust=0.5))
# JPEG ----
jpeg(
file="paiements.jpg",
)
print(expenses_df)
dev.off()
Data in dput
format
expenses <-
structure(list(DATE = c("2020-01-25", "2020-02-24", "2020-01-30",
"2020-03-02", "2020-01-22", "2020-08-27"), MONTANT = c(67.17,
7.67, 10.28, 215.25, 114.6, 78.33), LIEU = c("Coop Cégep", "Coop Cégep",
"Coop Cégep", "Omnivox (Cégep Lanaudière)", "Coop Cégep",
"Coop Cégep"), CAUSE = c("Notes de cours", "Notes de cours",
"Cahiers d'exercices", "Frais de scholarité", "Romans, Notes de cours",
"Romans, Notes de cours")), row.names = c(NA, -6L), class = "data.frame")
The problem seems to be the last file line. The character string "<++>"
ending each column is messing up the numeric column MONTANT
. Here is a way of solving it.
MONTANT
to numeric;NA
, with a warning "NAs introduced by coercion"
;!is.na(.)
.The code will be the following.
expenses$MONTANT <- as.numeric(expenses$MONTANT)
expenses <- expenses[!is.na(expenses$MONTANT), ]
Now coerce the date column to class "Date"
and plot. I have filled the bars with CAUSE
defining their color.
expenses$DATE <- as.Date(expenses$DATE)
library(ggplot2)
ggplot(expenses, aes(LIEU, MONTANT, fill = CAUSE)) +
geom_bar(stat="identity") +
geom_jitter(width=0.1, height=0, shape=18, size=4) +
labs(
title="Montants de diverses dépenses scholaires",
x="Lieu",
y="Montant") +
theme(plot.title = element_text(hjust=0.5))