Search code examples
rggplot2histogramkernel-densitydensity-plot

Averaging geom_density(y=..count..) over a grouping variable


I plot some distributions using:

geom_density(aes(my.variable,
color=my.factor,
group=my.replicates,
y=..count..))

I want to plot the average lines over replicates (one line for each levels of my.factor), considering that I don't have the same number of replicates within each level of my.factor --> I can't just remove the 'group' argument, since ..count.. depends on the number of replicates. I would like therefore something like ..count../number of replicates

Here is the context and a reproducible exemple

I sampled in 2 habitats (a and b): fish number and body size of each individual. I had a different sampling effort between habitats. (ra and rb are the number of replicates sampled within the habitats a and b, respectivelly) I am interested in average differences between habitats in term of both fish abundance and body size. However, I don't know how to deal with the fact that I don't have the same number of replicat.

DATA

#number of replicat
ra=4;rb=6
#number of individuals (lambda of poisson distribution)
na=30;nb=60
#size of individuals (lambda of poisson distribution)
sa=90;sb=80

#data for habitat a
dfa=data.frame()
for (ri in 1:ra){
  habitat="a"
  nb_rep=ra
  replicat=paste("r",ri,sep="")
  size=rpois(rpois(1,na),sa)
  dfa=rbind.data.frame(dfa,data.frame(habitat,nb_rep,replicat,size))
}
#data for habitat b
dfb=data.frame()
for (ri in 1:rb){
  habitat="b"
  nb_rep=rb
  replicat=paste("r",ri,sep="")
  size=rpois(rpois(1,nb),sb)
  dfb=rbind.data.frame(dfb,data.frame(habitat,nb_rep,replicat,size))
}
#whole data set
df=rbind(dfa,dfb)

PLOTS

require(ggplot2)
summary(df)

density

ggplot(df,aes(size,color=habitat))+
geom_density(aes(y=..density..))

count

ggplot(df,aes(size,color=habitat))+
geom_density(aes(y=..count..))

But this is BIASED if habitats have not been sampled with the same effort i.e. different number of replicates

count, considering the different replicates

ggplot(df,aes(size,color=habitat,group=paste(habitat,replicat)))+
geom_density(aes(y=..count..))

From this last plot, how to get the average lines over replicates ? Thanks


Solution

  • I don't think you can do this within ggplot. You can calculate the density yourself and then plot the calculated density. Below I show that it actually works, by reproducing the plot you already have with ggplot(df,aes(size,color=habitat)) + geom_density(aes(y=..count..)).

    require(plyr)
    # calculate the density
    res <- dlply(df, .(habitat), function(x) density(x$size))
    dd <- ldply(res, function(z){
      data.frame(size = z[["x"]], 
                 count = z[["y"]]*z[["n"]])
    })
    # these two plots are essentially the same. 
    ggplot(dd, aes(size, count, color=habitat)) + 
      geom_line()
    ggplot(df,aes(size,color=habitat))+
      geom_density(aes(y=..count..))
    

    Now for the slightly more difficult task of averaging the densities of the different replicates.

    # calculate the density 
    res <- dlply(df, .(habitat), function(dat){
      lst <- dlply(dat, .(replicat), function(x) density(x$size, 
                                                         # specify to and from based on dat, not x. 
                                                         from=min(dat$size), 
                                                         to=max(dat$size)
      ))
      data.frame(size=lst[[1]][["x"]], 
                 #count=colMeans(laply(lst, function(a) a[["y"]]), na.rm=TRUE)*nrow(dat),
                 count=colMeans(laply(lst, function(a) a[["y"]]), na.rm=TRUE)*nrow(dat)/nlevels(droplevels(dat$replicat)), 
    
                 habitat=dat$habitat[1])
    })
    dd <- rbindlist(res)
    ggplot(dd, aes(size, count, color=habitat)) + 
      geom_line()