I have not been able to find a solution to a problem similar to this on StackOverflow. I hope someone can help!
I am using the R environment.
I have data from turtle nests. There are two types of hourly data in each nest. The first is hourly Temperature, and it has an associated hourly Development (amount of "anatomical" embryonic development").
I am calculating a weighted median. In this case, the median is temperature and it is weighted by development.
I have a script here that I am using to calculated weighted median:
weighted.median <- function(x, w, probs=0.5, na.rm=TRUE) {
x <- as.numeric(as.vector(x))
w <- as.numeric(as.vector(w))
if(anyNA(x) || anyNA(w)) {
ok <- !(is.na(x) | is.na(w))
x <- x[ok]
w <- w[ok]
}
stopifnot(all(w >= 0))
if(all(w == 0)) stop("All weights are zero", call.=FALSE)
#'
oo <- order(x)
x <- x[oo]
w <- w[oo]
Fx <- cumsum(w)/sum(w)
#'
result <- numeric(length(probs))
for(i in seq_along(result)) {
p <- probs[i]
lefties <- which(Fx <= p)
if(length(lefties) == 0) {
result[i] <- x[1]
} else {
left <- max(lefties)
result[i] <- x[left]
if(Fx[left] < p && left < length(x)) {
right <- left+1
y <- x[left] + (x[right]-x[left]) * (p-Fx[left])/(Fx[right]- Fx[left])
if(is.finite(y)) result[i] <- y
}
}
}
names(result) <- paste0(format(100 * probs, trim = TRUE), "%")
return(result)
}
So from the function you can see that I need two input vectors, x and w (which will be temperature and development, respectively).
The problem I'm having is that I have hourly temperature traces that last anywhere from 5 days to 53 days (i.e., 120 hours to 1272 hours).
I would like to calculate the daily weighted median for all days within a nest (i.e., take the 24 rows of x and w, and calculate the weighted median, then move onto rows 25-48, and so forth.) The output vector would therefore be a list of daily weighted medians with length n/24 (where n is the total number of rows in x).
In other words, I would like to analyse my data automatically, in a fashion equivalent to manually doing this (nest1 is the datasheet for Nest 1 which contains two vectors, temp and devo (devo is the weight))):
`weighted.median(nest1$temp[c(1,1:24)],nest1$devo[c(1,1:24)],na.rm=TRUE)`
followed by
weighted.median(nest1$temp[c(1,25:48)],nest1$devo[c(1,25:48)],na.rm=TRUE)
followed by
weighted.median(nest1$temp[c(1,49:72)],nest1$devo[c(1,49:72)],na.rm=TRUE)
all the way to
`weighted.median(nest1$temp[c(1,n-23:n)],nest1$devo[c(1,n-23:n)],na.rm=TRUE)`
I'm afraid I don't even know where to start. Any help or clues would be very much appreciated.
The main idea is to create a new column for day 1, day 2, ..., day n/24, split the dataframe into subsets by day, and apply your function to each subset.
First I create some sample data:
set.seed(123)
n <- 120 # number of rows
nest1 <- data.frame(temp = rnorm(n), devo = rpois(n, 5))
Create the splitting variable:
nest1$day <- rep(1:(nrow(nest1)/24), each = 24)
Then, use the by()
function to split nest1
by nest1$day
and apply the function to each subset:
out <- by(nest1, nest1$day, function(d) {
weighted.median(d$temp, d$devo, na.rm = TRUE)
})
data.frame(day = dimnames(out)[[1]], x = as.vector(out))
# day x
# 1 1 -0.45244433
# 2 2 0.15337312
# 3 3 0.07071673
# 4 4 0.23873174
# 5 5 -0.27694709
Instead of using by
, you can also use the group_by
+ summarise
functions from the dplyr
package:
library(dplyr)
nest1 %>%
group_by(day) %>%
summarise(x = weighted.median(temp, devo, na.rm = TRUE))
# # A tibble: 5 x 2
# day x
# <int> <dbl>
# 1 1 -0.452
# 2 2 0.153
# 3 3 0.0707
# 4 4 0.239
# 5 5 -0.277