I'd like to write some conditional stats in my graph if the data is bigger than a certain value.
With the kind help of Jack Ryan (Cut data and access groups to draw percentile lines), I could create the following script that groups data into hours and plots the result:
# Read example data
A <- read.csv(url('http://people.ee.ethz.ch/~hoferr/download/data-20130812.csv'))
# Libraries
library(doBy)
library(ggplot2)
library(plyr)
library(reshape2)
library(MASS)
library(scales)
# Sample size function
give.n <- function(x){
return(c(y = min(x) - 0.2, label = length(x)))
}
# Calculate gaps
gaps <- rep(NA, length(A$Timestamp))
times <- A$Timestamp
loss <- A$pingLoss
gap.start <- 1
gap.end <- 1
for(i in 2:length(A$Timestamp))
{ #For all rows
if(is.na(A$pingRTT.ms.[i]))
{ #Currently no connection
if(!is.na(A$pingRTT.ms.[i-1]))
{ #Connection lost now
gap.start <- i
}
if(!is.na(A$pingRTT.ms.[i+1]))
{ # Connection restores next time
gap.end <- i+1
gaps[gap.start] <- as.numeric(A$Timestamp[gap.end]-A$Timestamp[gap.start], units="secs")
loss[gap.start] <- gap.end - gap.start
}
}
}
H <- data.frame(times, gaps, loss)
H <- H[complete.cases(H),]
C <- H
C$dates <- strptime(C$times, "%Y-%m-%d %H:%M:%S")
C$h1 <- C$dates$hour
# Calculate percentiles
cuts <- c(1, .75, .5, .25, 0)
c <- ddply(C, .(h1), function (x) { summarise(x, y = quantile(x$gaps, cuts)) } )
c$cuts <- cuts
c <- dcast(c, h1 ~ cuts, value.var = "y")
c.melt <- melt(c, id.vars = "h1")
p <- ggplot(c.h1.melt, aes(x = h1, y = value, color = variable)) +
geom_point(size = 4) +
stat_summary(fun.data = max.n, geom = "text", fun.y = max, colour = "red", angle = 90, size=4) +
scale_colour_brewer(palette="RdYlBu", name="Percentile", guide = guide_legend(reverse=TRUE)) +
scale_x_continuous(breaks=0:23, limits = c(0,23)) +
annotation_logticks(sides = "lr") +
theme_bw() +
scale_y_log10(breaks=c(1e0,1e1,1e2,1e3,1e4), labels = trans_format("log10", math_format(10^.x)), limits=c(1e0,1e4)) +
xlab("Hour of day") + ylab("Ping gaps [s]")
p
p <- ggplot(c.m1.melt, aes(x = m1/60, y = value, color = variable)) +
geom_point(size = 1) +
stat_summary(fun.data = give.n, geom = "text", fun.y = median, angle = 90, size=4) +
stat_summary(fun.data = max.n, geom = "text", fun.y = max, colour = "red", angle = 90, size=4) +
scale_colour_brewer(palette="RdYlBu", name="Percentile", guide = guide_legend(reverse=TRUE)) +
scale_x_continuous(breaks=0:23, limits = c(0,24)) +
annotation_logticks(sides = "lr") +
theme_bw() +
scale_y_log10(breaks=c(1e0,1e1,1e2,1e3,1e4), labels = trans_format("log10", math_format(10^.x)), limits=c(1e0,1e4)) +
xlab("Time of day") + ylab("Ping gaps [s]")
p
This creates an hourly grouped plot of gaps with the length of the longest gaps written right next to the data points:
Below is the minutely grouped plot. The number are unreadable why I'd like to add conditional stats if the gap is longer than 5 minutes or only for the ten longest gaps or something like this.
I tried to just change the stat function to
max.n.filt <- function(x){
filter = 300
if ( x > filter ) {
return(c(y = max(x) + 0.4, label = round(max(10^x),2)))
} else {
return(c(y=x, label = ""))
}
}
and use this for the minutely grouped plot. But I got this error:
Error in list_to_dataframe(res, attr(.data, "split_labels")) :
Results do not have equal lengths
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Error in if (nrow(layer_data) == 0) return() : argument is of length zero
Calls: print ... print.ggplot -> ggplot_gtable -> Map -> mapply -> <Anonymous>
In addition: Warning message:
Removed 6 rows containing missing values (geom_point).
In addition, in the hourly plot, I'd like to write the number of samples per hour right next to the length of the gaps. I think I can add a new column to the c data frame, but unfortunately I can't find a way to do this.
Any help is very much appreciated.
See ?stat_summary.
fun.data : Complete summary function. Should take data frame as input and return data frame as output
Your function max.n.filt
uses an if()
statement that tries to evaluate the condition x > filter
. But when length(x) > 1
, the if()
statement only evaluates the condition for the first value of x
. When used on a data frame, this will return a list cobbled together from the original input x
and whatever label the if()
statement returns.
> max.n.filt(data.frame(x=c(10,15,400)))
$y.x
[1] 10 15 400
$label
[1] ""
Try a function that uses ifelse()
instead:
max.n.filt2 <- function(x){
filter = 300 # whatever threshold
y = ifelse( x > filter, max(x) + 1, x[,1] )
label = ifelse( x > filter, round(max(x),2), NA )
return(data.frame(y=y[,1], label=label[,1]))
}
> max.n.filt2(data.frame(x=c(10,15,400)))
y label
1 10 NA
2 15 NA
3 401 400
Alternatively, you might just find it easier to use geom_text()
. I can't reproduce your example, but here's a simulated dataset:
set.seed(101)
sim_data <- expand.grid(m1=1:1440, variable=factor(c(0,0.25,0.5,0.75,1)))
sim_data$sample_size <- sapply(1:1440, function(.) sample(1:25, 1, replace=T))
sim_data$value = t(sapply(1:1440, function(.) quantile(rgamma(sim_data$sample_size, 0.9, 0.5),c(0,0.25,0.5,0.75,1))))[1:(1440*5)]
Just use the subset
argument in geom_text()
to select those points you wish to label:
ggplot(sim_data, aes(x = m1/60, y = value, color = variable)) +
geom_point(size = 4) + geom_text(aes(label=round(value)), subset = .(variable == 1 & value > 25), angle = 90, size = 4, colour = "red", hjust = -0.5)
If you have a column of sample sizes, those can be incorporated into label
with paste()
:
ggplot(sim_data, aes(x = m1/60, y = value, color = variable)) +
geom_point(size = 4) + geom_text(aes(label=paste(round(value),", N=",sample_size)), subset = .(variable == 1 & value > 25), angle = 90, size = 4, colour = "red", hjust = -0.25)
(or create a separate column in your data with whatever labels you want.) If you're asking about how to retrieve the sample sizes, you could modify your call to ddply()
like this:
...
c2 <- ddply(C, .(h1), function (x) { cbind(summarise(x, y = quantile(x$gaps, cuts)), n=nrow(x)) } )
c2$cuts <- cuts
c2 <- dcast(c2, h1 + n ~ cuts, value.var = "y")
c2.h1.melt <- melt(c2, id.vars = c("h1","n"))
...