I am looking for a way to plot the distribution of the mean values of one variable across bins of log2 values of another variable (which has positive and negative values), exploiting the more complicated functions in ggplot2. I think I am majorly complicating this and it is probably hard coded in ggplot2 refined options, but I cannot get it right so before going back to the basics I thought I may try to learn how to apply these functions here.
value <- rnorm(1000,0,20)
dist = c(rep(0, 15), sample(1:490), sample(-1:-495))
data = data.frame(value=value, dist=dist)
data$log=log2(abs(data$dist)+1)
# re-lable the x-axis:
data$Labels=2^(abs(data$log))-1
data$bins=cut(data$log, breaks=10)
# Try to recover the negative log after transformation
data$sign=ifelse(data$dist==0, 0, ifelse(data$dist>0, "+", "-"))
# find the average expression of value per each bin
data=with(data, aggregate(data$value, by = list(bins, sign), FUN = function(x) c(mn =mean(x), n=length(x) )))
data= as.data.frame(as.list(data))
names(data)=c("bins", "sign", "mean", "length")
# I am doing this in a very contorted way to try to achieve what I would like which is something like this:
bin_num = do.call("rbind", lapply(strsplit(sapply(as.character(data$bins), function(x) substr(x, 2, nchar(x)-1)), ","), as.numeric))
data$bin_num=bin_num[,1]
data$bin_num=ifelse(data$sign==0, 0, ifelse(data$sign=="-", -data$bin_num, data$bin_num))
data = data[order(data$bin_num),]
data <- transform(data, x2 = factor(paste(sign, bins)))
data <- transform(data, x2 = reorder(x2, rank(bin_num)))
# Line plot to show the distribution of the means across the bins of log2 of x:
ggplot(data, aes(y = mean, x = bin_num, group=1)) + geom_point() + geom_line()
# Then I am trying to re-label the logarithmic transformations here by adding labels, but of course it is not working:
ggplot(data, aes(y = mean, x = bin_num, group=1)) + geom_point() + geom_line() + scale_x_discrete(labels=data$dist, breaks=data$bin_num)
I see that ggplot2 has functionalities to directly compute the mean so I maybe would not even need the previous commands. I tried:
ggplot(data, aes(x = bins, y = mean)) + stat_summary(fun.y = "mean") + geom_line() + scale_x_continuous(breaks = labels)
But of course it doesn't work... I also saw that the ggplo2 has functions to automatically help with logarithmic labelling instead of what I used here, but I don't see how to do this when there are negative values to be logged. There is a very nice function from another question here which converts the two values, but I don't see it useful at this stage. Thanks very much for any suggestions on how to go about this...really appreciated!
First version of an answer, using data.table
for faster speeds and better readability:
The code reproduces the question with shorter and faster code
library(data.table)
# function that returns the lower bound of a cut
lower.bound <- function(x, n) {
c <- cut(x, n)
tmp <- substr(x = c, start = 2, stop = regexpr(",", c) - 1)
return(as.numeric(tmp))
}
nbin <- 10
set.seed(123)
dat <- data.table(value = rnorm(1000,0, 20),
dist = c(rep(0, 15), sample(1:490), sample(-1:-495)))
dat[, log := log2(abs(dist) + 1)]
dat[, labels := 2^(abs(log))]
dat[, sign := ifelse(dist == 0,
0,
ifelse(dist > 0, "+", "-"))]
dat[, bin := ifelse(sign == 0,
0,
ifelse(sign == "+",
lower.bound(log, nbin),
-lower.bound(log, nbin)))]
sumdat <- dat[, .(mvalue = mean(value),
nvalue = .N,
ylab = mean(dist)),
by = .(bin, sign)][order(bin)]
ggplot(sumdat, aes(x = ylab, y = mvalue)) + geom_line()