Search code examples
rggplot2plotcolorsdistribution

ggplot & summary: best way to draw summary statistics?


I'm trying to find a way to draw some summary statistics, but I do not understand which is the right one...

I have 2 datasets about the same cities, and I need to give a graphical representation at 4 four variables ( 2 for the first dataset, 2 for the second)

These are my summaries

> summary(data_1$New_wage)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  777.7  1480.0  1633.1  1634.6  1774.3  2408.1 

> summary(data_1$Old_wage)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  471.3   658.9   693.1   696.9   735.2  1001.9 

> summary(data_2$New_wage)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1895    2072    2154    2166    2259    2543 

> summary(data_2$Old_wage)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1777    2109    2236    2244    2352    2833 

I need to draw them in a unique plot, to show at which quintile each distribution starts and if the median(or the mean) are near or not.

What is the best way to show this kind of info?

I've thought with a density plot ( is it the right one?)

ggplot()+
  geom_density(data=data_1, aes(x= `New_wage`),alpha=.8, fill="pink")+
  geom_density(data=data_1, aes(x= `Old_wage`),alpha=.8, fill="yellow")+
  geom_density(data=data_2, aes(x= `New_wage`),alpha=.8, fill="lightblue")+
  geom_density(data=data_2, aes(x= `Old_wage`),alpha=.8, fill="orange")+
  geom_vline(xintercept = 1633.1, size=0.5)+
  geom_vline(xintercept = 693.1, size=0.5)+
  geom_vline(xintercept = 2154, size=0.5)+
  geom_vline(xintercept = 2236, size=0.5)+
  theme_classic()

But What I get is pretty ugly

ugly plot

Do you know a way to make it better? Or any other kind of plot that can show these info in a better way??

Here a sample from each dataset

> data_1
      City           New_wage     Old_wage
Torino                1962.18     770.51
Alessandria           1742.85     676.4 
Asti                  1541.81     609.46
Biella                1612.2      741.55
Cuneo                 1574        637.71
Novara                1823.53     715.83
Verbano -Cusio-Ossola 1584.49     640.15
Vercelli              1666.21     735.68
Aosta                 1747.81     695.71
Genova                2066.42     738.37
Imperia               1498.01     646.5 
La Spezia             1871.34     693.83
Savona                1770.41     676.71
Milano                2240.03     851.42
Bergamo               1729.17     586.84
> data_2
      City           New_wage    Old_wage 
Torino                2122.48    2335.66
Alessandria           2081.89    2268.23
Asti                  2034.57    2238.94
Biella                1941.49    2394.96
Cuneo                 1998.25    2288.37
Novara                2121.21    2468.1 
Verbano -Cusio-Ossola 2025.62    2146.13
Vercelli              2031.75    2385.21
Aosta                 2099.45    2264.07
Genova                2160.59    2378.96
Imperia               2056.4     2171.72
La Spezia             2218.75    2761.92
Savona                2002.54    2215.3 
Milano                2027.45    2358.53
Bergamo               1905.06    2340.58

Solution

  • First, bind your data frames together, then pivot into long format.

    library(tidyverse)
    
    df <- bind_rows(data_1, data_2, .id = "data") %>%
      mutate(Data = c("Data 1", "Data 2")[as.numeric(data)]) %>%
      pivot_longer(New_wage:Old_wage, names_to = "Wage") %>%
      mutate(Wage = factor(Wage, c("Old_wage", "New_wage")))
    

    You can then choose how you want to represent the data.

    For example, you can get a fairly sophisticated result using ggstatsplot:

    library(ggstatsplot)
    
    grouped_ggwithinstats(df, Wage, value, grouping.var = Data)
    

    enter image description here

    Or if you prefer vanilla ggplot, you could do something like:

    ggplot(df, aes(Data, value, fill = Wage)) +
      geom_violin(alpha = 0.5) +
      geom_point(position = position_jitterdodge(dodge.width = 0.9, 
                                                 jitter.width = 0.2),
                 alpha = 0.3) +
      geom_errorbar(stat = "summary", position = "dodge", key_glyph = draw_key_path,
                    aes(ymax = after_stat(y), ymin = after_stat(y), 
                        linetype = "mean")) +
      scale_fill_manual(values = c("deepskyblue4", "orange")) +
      labs(linetype = NULL) +
      theme_minimal(base_size = 16)
    

    enter image description here