Search code examples
rhistogramkernel-densitydensity-plot

Is it possible to create this graph on R?


I'm really new to R and I'm looking to create a graph similar to the one attached. I have tried to create a density plot using both ggplot and the base program.

I have used code ggplot(data, aes(x = Freq)) + geom_density() but the output is incorrect. I'm getting a spike at each number point rather than an overall curve. Every row is one data point of between 1 to 7 and the frequency distributions for one trait is as follows:

1: 500, 2: 550 3:700 4:1000 5:900 6:835: 7:550

As such I have 5035 rows as one row equates to one score.

Any help is much appreciated.

Here is what I wish the plot would look like. (Note I'll add other traits at a later stage, I just wish to add one line at the moment).

enter image description here


Solution

  • there are a few things going on here, first is generating summary statistics of the data. you just need to call mean and sd in the appropriate way to get mean and standard deviation from your data. you've not shown your data so it would be difficult to suggest much here.

    as far as plotting these summary statistics, you can replicate the plot from the original paper easily, but it's pretty bad and I'd suggest you not do that. stronger lines imply more importance, the need to double label everything, mislabelling the y-axis, all of that on top of drawing nice smooth parametric curves gives a false impression of confidence. I've only scanned the paper, but that sort of data is crying out for a multi-level model of some sort

    I prefer "base" graphics, ggplot is great for exploratory graphics but if you have hard constraints on what a plot should look like it tends to get in the way. We start with the summary statistics:

    df <- read.csv(text="
    title,              mu, sigma,label, label_x,label_pos
    Extraversion,       4.0, 1.08,Extra,    3.85,3
    Agreeableness,      5.0, 0.77,Agree,    5.0, 3
    Conscientiousness,  4.7, 0.97,Cons,     3.4, 2
    Emotional stability,5.3, 0.84,Emot stab,5.9, 4
    Intellect,          3.7, 0.86,Intellect,3.7, 3
    ")
    

    I've just pulled numbers out of the paper here, you'd have to calcular them. the mu column is the mean of the variable, and sigma is the standard deviation. label_x and label_pos are used to draw labels so need to be manually chosen (or the plot can be annotated afterwards in something like Inkscape). label_x is the x-axis position, and label_pos stands for where it is in relation to the x-y point (see text for info about the pos parameter)

    next we calculate a couple of things:

    lwds <- 1 + seq(3, 1, len=5) ^ 2
    label_y <- dnorm(df$label_x, df$mu, df$sigma)
    

    i.e. line widths and label y positions, and we can start to make the plot:

    # start by setting up plot nicely and setting plot limits
    par(bty='l', mar=c(3, 3, 0.5, 0.5), mgp=c(1.8, 0.4, 0), tck=-0.02)
    plot.new(); plot.window(c(1, 7), c(0, 0.56), yaxs='i')
    
    # loop over data drawing curves
    for (i in 1:nrow(df)) {
        curve(dnorm(x, df$mu[[i]], df$sigma[[i]]), add=T, n=151, lwd=lwds[[i]])
    }
    
    # draw labels
    text(df$label_x, label_y, df$label, pos=df$label_pos)
    
    # draw axes
    axis(1, lwd=0, lwd.ticks=1)
    axis(2, lwd=0, lwd.ticks=1)
    box(lwd=1)
    
    # finally, title and legend
    title(xlab='Level of state', ylab='Probability density')
    legend('topleft', legend=df$title, lwd=lwds, bty='n', cex=0.85)
    

    this gives us something like:

    replicate original plot

    I've also gone with more modern capitalisation, and started the y-axis at zero as these are probabilities so can't be negative

    My preferences would be for something closer to this:

    my version

    the thin lines cover 2 standard deviations (i.e. 95% intervals) around the mean, thick lines 1 SDs (68%), and the point is the mean. it's much easier to discriminate each measure and compare across them, and it doesn't artificially make "extraversion" more prominent. the code for this is similar:

    par(bty='l', mar=c(3, 8, 0.5, 0.5), mgp=c(1.8, 0.4, 0), tck=-0.02)
    plot.new(); plot.window(c(1, 7), c(5.3, 0.7))
    
    # draw quantiles
    for (i in 1:nrow(df)) {
        lines(df$mu[[i]] + df$sigma[[i]] * c(-1, 1), rep(i,2), lwd=3)
        lines(df$mu[[i]] + df$sigma[[i]] * c(-2, 2), rep(i,2), lwd=1)
    }
    # and means
    points(df$mu, 1:5, pch=20)
    
    axis(1, lwd=0, lwd.ticks=1)
    axis(2, at=1:5, labels=df$title, lwd=0, lwd.ticks=1, las=1)
    box()
    
    title(xlab='Level of state')