Search code examples
rloopsiterationmedian

Trying to make a script calculate a value (using a function) for every 24 rows


I have not been able to find a solution to a problem similar to this on StackOverflow. I hope someone can help!

I am using the R environment.

I have data from turtle nests. There are two types of hourly data in each nest. The first is hourly Temperature, and it has an associated hourly Development (amount of "anatomical" embryonic development").

I am calculating a weighted median. In this case, the median is temperature and it is weighted by development.

I have a script here that I am using to calculated weighted median:

weighted.median <- function(x, w, probs=0.5, na.rm=TRUE) {
  x <- as.numeric(as.vector(x))
  w <- as.numeric(as.vector(w))
  if(anyNA(x) || anyNA(w)) {
    ok <- !(is.na(x) | is.na(w))
        x <- x[ok]
    w <- w[ok]
  }
  stopifnot(all(w >= 0))
  if(all(w == 0)) stop("All weights are zero", call.=FALSE)
  #'
  oo <- order(x)
  x <- x[oo]
  w <- w[oo]
  Fx <- cumsum(w)/sum(w)
  #'
  result <- numeric(length(probs))
  for(i in seq_along(result)) {
    p <- probs[i]
    lefties <- which(Fx <= p)
    if(length(lefties) == 0) {
      result[i] <- x[1]
    } else {
      left <- max(lefties)
      result[i] <- x[left]
      if(Fx[left] < p && left < length(x)) {
        right <- left+1
        y <- x[left] + (x[right]-x[left]) * (p-Fx[left])/(Fx[right]-        Fx[left])
        if(is.finite(y)) result[i] <- y
      }
    }
  }
  names(result) <- paste0(format(100 * probs, trim = TRUE), "%")
  return(result)
}

So from the function you can see that I need two input vectors, x and w (which will be temperature and development, respectively).

The problem I'm having is that I have hourly temperature traces that last anywhere from 5 days to 53 days (i.e., 120 hours to 1272 hours).

I would like to calculate the daily weighted median for all days within a nest (i.e., take the 24 rows of x and w, and calculate the weighted median, then move onto rows 25-48, and so forth.) The output vector would therefore be a list of daily weighted medians with length n/24 (where n is the total number of rows in x).

In other words, I would like to analyse my data automatically, in a fashion equivalent to manually doing this (nest1 is the datasheet for Nest 1 which contains two vectors, temp and devo (devo is the weight))):

`weighted.median(nest1$temp[c(1,1:24)],nest1$devo[c(1,1:24)],na.rm=TRUE)`

followed by

weighted.median(nest1$temp[c(1,25:48)],nest1$devo[c(1,25:48)],na.rm=TRUE)

followed by

weighted.median(nest1$temp[c(1,49:72)],nest1$devo[c(1,49:72)],na.rm=TRUE)

all the way to

`weighted.median(nest1$temp[c(1,n-23:n)],nest1$devo[c(1,n-23:n)],na.rm=TRUE)`

I'm afraid I don't even know where to start. Any help or clues would be very much appreciated.


Solution

  • The main idea is to create a new column for day 1, day 2, ..., day n/24, split the dataframe into subsets by day, and apply your function to each subset.

    First I create some sample data:

    set.seed(123)
    n <- 120 # number of rows
    nest1 <- data.frame(temp = rnorm(n), devo = rpois(n, 5))
    

    Create the splitting variable:

    nest1$day <- rep(1:(nrow(nest1)/24), each = 24)
    

    Then, use the by() function to split nest1 by nest1$day and apply the function to each subset:

    out <- by(nest1, nest1$day, function(d) {
      weighted.median(d$temp, d$devo, na.rm = TRUE)
    })
    data.frame(day = dimnames(out)[[1]], x = as.vector(out))
    #   day           x
    # 1   1 -0.45244433
    # 2   2  0.15337312
    # 3   3  0.07071673
    # 4   4  0.23873174
    # 5   5 -0.27694709
    

    Instead of using by, you can also use the group_by + summarise functions from the dplyr package:

    library(dplyr)
    nest1 %>%
      group_by(day) %>%
      summarise(x = weighted.median(temp, devo, na.rm = TRUE))
    # # A tibble: 5 x 2
    #     day       x
    #   <int>   <dbl>
    # 1     1 -0.452 
    # 2     2  0.153 
    # 3     3  0.0707
    # 4     4  0.239 
    # 5     5 -0.277