I'm trying to find a way to draw some summary statistics, but I do not understand which is the right one...
I have 2 datasets about the same cities, and I need to give a graphical representation at 4 four variables ( 2 for the first dataset, 2 for the second)
These are my summaries
> summary(data_1$New_wage)
Min. 1st Qu. Median Mean 3rd Qu. Max.
777.7 1480.0 1633.1 1634.6 1774.3 2408.1
> summary(data_1$Old_wage)
Min. 1st Qu. Median Mean 3rd Qu. Max.
471.3 658.9 693.1 696.9 735.2 1001.9
> summary(data_2$New_wage)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1895 2072 2154 2166 2259 2543
> summary(data_2$Old_wage)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1777 2109 2236 2244 2352 2833
I need to draw them in a unique plot, to show at which quintile each distribution starts and if the median(or the mean) are near or not.
What is the best way to show this kind of info?
I've thought with a density plot ( is it the right one?)
ggplot()+
geom_density(data=data_1, aes(x= `New_wage`),alpha=.8, fill="pink")+
geom_density(data=data_1, aes(x= `Old_wage`),alpha=.8, fill="yellow")+
geom_density(data=data_2, aes(x= `New_wage`),alpha=.8, fill="lightblue")+
geom_density(data=data_2, aes(x= `Old_wage`),alpha=.8, fill="orange")+
geom_vline(xintercept = 1633.1, size=0.5)+
geom_vline(xintercept = 693.1, size=0.5)+
geom_vline(xintercept = 2154, size=0.5)+
geom_vline(xintercept = 2236, size=0.5)+
theme_classic()
But What I get is pretty ugly
Do you know a way to make it better? Or any other kind of plot that can show these info in a better way??
Here a sample from each dataset
> data_1
City New_wage Old_wage
Torino 1962.18 770.51
Alessandria 1742.85 676.4
Asti 1541.81 609.46
Biella 1612.2 741.55
Cuneo 1574 637.71
Novara 1823.53 715.83
Verbano -Cusio-Ossola 1584.49 640.15
Vercelli 1666.21 735.68
Aosta 1747.81 695.71
Genova 2066.42 738.37
Imperia 1498.01 646.5
La Spezia 1871.34 693.83
Savona 1770.41 676.71
Milano 2240.03 851.42
Bergamo 1729.17 586.84
> data_2
City New_wage Old_wage
Torino 2122.48 2335.66
Alessandria 2081.89 2268.23
Asti 2034.57 2238.94
Biella 1941.49 2394.96
Cuneo 1998.25 2288.37
Novara 2121.21 2468.1
Verbano -Cusio-Ossola 2025.62 2146.13
Vercelli 2031.75 2385.21
Aosta 2099.45 2264.07
Genova 2160.59 2378.96
Imperia 2056.4 2171.72
La Spezia 2218.75 2761.92
Savona 2002.54 2215.3
Milano 2027.45 2358.53
Bergamo 1905.06 2340.58
First, bind your data frames together, then pivot into long format.
library(tidyverse)
df <- bind_rows(data_1, data_2, .id = "data") %>%
mutate(Data = c("Data 1", "Data 2")[as.numeric(data)]) %>%
pivot_longer(New_wage:Old_wage, names_to = "Wage") %>%
mutate(Wage = factor(Wage, c("Old_wage", "New_wage")))
You can then choose how you want to represent the data.
For example, you can get a fairly sophisticated result using ggstatsplot
:
library(ggstatsplot)
grouped_ggwithinstats(df, Wage, value, grouping.var = Data)
Or if you prefer vanilla ggplot, you could do something like:
ggplot(df, aes(Data, value, fill = Wage)) +
geom_violin(alpha = 0.5) +
geom_point(position = position_jitterdodge(dodge.width = 0.9,
jitter.width = 0.2),
alpha = 0.3) +
geom_errorbar(stat = "summary", position = "dodge", key_glyph = draw_key_path,
aes(ymax = after_stat(y), ymin = after_stat(y),
linetype = "mean")) +
scale_fill_manual(values = c("deepskyblue4", "orange")) +
labs(linetype = NULL) +
theme_minimal(base_size = 16)