Search code examples
rggplot2plotbar-chartgeom-bar

How do I get reliable scale ticks with bar plots that sum up numbers into single bars in ggplot (R)?


I have a simple ggplot bar plot which displays information about school expenses. It retrieves it's information from a data-frame with the following columns:

  • Where the purchase was made (there are two recurrent locations)
  • What was the purchase amount in dollars

You can take a closer look at this data at the end of this post (csv format).

Each bar in my plot represents a different purchase location. The bars stacks multiple colours for each purchase made (proportional to its amount). Here is a look at my plot:

Example plot

As you can see, the scaling is clearly off (the 10.28 tick is about a third as high as the 215.25 tick in the y axis).

How should I go about making the scaling accurate and what is causing this inaccurate y axis?

Here is my raw csv file:

"DATE"      ;"MONTANT";"LIEU"                      ;"CAUSE"
"2020-01-25";    67.17;"Coop Cégep"                ;"Notes de cours"
"2020-02-24";     7.67;"Coop Cégep"                ;"Notes de cours"
"2020-01-30";    10.28;"Coop Cégep"                ;"Cahiers d'exercices"
"2020-03-02";   215.25;"Omnivox (Cégep Lanaudière)";"Frais de scholarité"
"2020-01-22";   114.60;"Coop Cégep"                ;"Romans, Notes de cours"
"2020-08-27";    78.33;"Coop Cégep"                ;"Romans, Notes de cours"
"<++>"      ;     <++>;"<++>"                      ;"<++>"

Here is the code I used to generate this image:

#!/bin/Rscript

# LIBRARIES ----

library(ggplot2)
library(RColorBrewer)

# CSV's ----

expenses <- head(data.frame(read.csv("paiements.csv", header=TRUE, sep=";")), -1)
expenses$DATE  <- as.Date(expenses$DATE)

# PLOTS ----

# Bar plot with different expenses sorted by location
expenses_df <- ggplot(expenses, aes(LIEU, MONTANT, fill=MONTANT)) +
    geom_bar(stat="identity") +
    geom_jitter(width=0.1, height=0, shape=18, size=4) +
    labs(
             title="Montants de diverses dépenses scholaires",
             x="Lieu",
             y="Montant") +
    theme(plot.title = element_text(hjust=0.5))

# JPEG ----

jpeg(
        file="paiements.jpg",
)

print(expenses_df)

dev.off()

Data in dput format

expenses <-
structure(list(DATE = c("2020-01-25", "2020-02-24", "2020-01-30", 
"2020-03-02", "2020-01-22", "2020-08-27"), MONTANT = c(67.17, 
7.67, 10.28, 215.25, 114.6, 78.33), LIEU = c("Coop Cégep", "Coop Cégep", 
"Coop Cégep", "Omnivox (Cégep Lanaudière)", "Coop Cégep", 
"Coop Cégep"), CAUSE = c("Notes de cours", "Notes de cours", 
"Cahiers d'exercices", "Frais de scholarité", "Romans, Notes de cours", 
"Romans, Notes de cours")), row.names = c(NA, -6L), class = "data.frame")

Solution

  • The problem seems to be the last file line. The character string "<++>" ending each column is messing up the numeric column MONTANT. Here is a way of solving it.

    1. Coerce the column MONTANT to numeric;
    2. Vector elements that cannot be numeric become NA, with a warning "NAs introduced by coercion";
    3. Remove those rows with !is.na(.).

    The code will be the following.

    expenses$MONTANT <- as.numeric(expenses$MONTANT)
    expenses <- expenses[!is.na(expenses$MONTANT), ]
    

    Now coerce the date column to class "Date" and plot. I have filled the bars with CAUSE defining their color.

    expenses$DATE  <- as.Date(expenses$DATE)
    
    library(ggplot2)
    
    ggplot(expenses, aes(LIEU, MONTANT, fill = CAUSE)) +
      geom_bar(stat="identity") +
      geom_jitter(width=0.1, height=0, shape=18, size=4) +
      labs(
        title="Montants de diverses dépenses scholaires",
        x="Lieu",
        y="Montant") +
      theme(plot.title = element_text(hjust=0.5))
    

    enter image description here