Search code examples
rggplot2histogramdensity-plot

Generating a histogram and density plot from binned data


I've binned some data and currently have a dataframe that consists of two columns, one that specifies a bin range and another that specifies the frequency like this:-

> head(data)
      binRange Frequency
1    (0,0.025]        88
2 (0.025,0.05]        72
3 (0.05,0.075]        92
4  (0.075,0.1]        38
5  (0.1,0.125]        20
6 (0.125,0.15]        16

I want to plot a histogram and density plot using this but I can't seem to find a way of doing so without having to generate new bins etc. Using this solution here I tried to do the following:-

p <- ggplot(data, aes(x= binRange, y=Frequency)) + geom_histogram(stat="identity")

but it crashes. Anyone know of how to deal with this?

Thank you


Solution

  • the problem is that ggplot doesnt understand the data the way you input it, you need to reshape it like so (I am not a regex-master, so surely there are better ways to do is):

    df <- read.table(header = TRUE, text = "
                     binRange Frequency
    1    (0,0.025]        88
    2 (0.025,0.05]        72
    3 (0.05,0.075]        92
    4  (0.075,0.1]        38
    5  (0.1,0.125]        20
    6 (0.125,0.15]        16")
    
    library(stringr)
    library(splitstackshape)
    library(ggplot2)
    # extract the numbers out,
    df$binRange <- str_extract(df$binRange, "[0-9].*[0-9]+")
    
    # split the data using the , into to columns:
    # one for the start-point and one for the end-point
    df <- cSplit(df, "binRange")
    
    # plot it, you actually dont need the second column
    ggplot(df, aes(x = binRange_1, y = Frequency, width = 0.025)) +
        geom_bar(stat = "identity", breaks=seq(0,0.125, by=0.025))
    

    or if you don't want the data to be interpreted numerically, you can just simply do the following:

    df <- read.table(header = TRUE, text = "
                     binRange Frequency
    1    (0,0.025]        88
    2 (0.025,0.05]        72
    3 (0.05,0.075]        92
    4  (0.075,0.1]        38
    5  (0.1,0.125]        20
    6 (0.125,0.15]        16")
    
    library(ggplot2)
    ggplot(df, aes(x = binRange, y = Frequency)) + geom_bar(stat = "identity")
    

    you won't be able to plot a density-plot with your data, given its not continous but rather categorical, thats why I actually prefer the second way of showing it,