Search code examples
rrowconditional-statementsmeanmoving-average

In R, average row value until hit a specific condition, then restart, with output in new column


I am working with GPS data and trying to figure out how to average the 11th-15th fixes for latitude and longitude. I have seen solutions in similar questions for how to average every n rows. The problem is that occasionally the satellites bomb out and the fixes stop at 13 or 14. So, in these cases, I only want to average 3 or 4 values instead of 5. So I am looking to average values for latitude and longitude starting from where the number in series is 11 until the number in series drops again (or as long as it is increasing? I need it to include the last set, which will not drop back down to a low number again). I have started by deleting all rows where the number in series is NOT in my desired 11-15 ranges. So,for an example dummy data set, this leaves me with:

      Date      Time     Long       Lat     NoInSeries
12  17/11/2014 22:09:17 115.9508 -31.82850    11
13  17/11/2014 22:09:18 115.9508 -31.82846    12
14  17/11/2014 22:09:19 115.9513 -31.82864    13
15  17/11/2014 22:09:21 115.9511 -31.82863    14
26  18/11/2014 00:07:14 115.9509 -31.82829    11
27  18/11/2014 00:07:15 115.9509 -31.82829    12
28  18/11/2014 00:07:16 115.9509 -31.82830    13
29  18/11/2014 00:07:17 115.9509 -31.82830    14
30  18/11/2014 00:07:18 115.9509 -31.82831    15
56  18/11/2014 10:00:24 115.9513 -31.82670    11
57  18/11/2014 10:00:25 115.9514 -31.82670    12
58  18/11/2014 10:00:26 115.9514 -31.82669    13
59  18/11/2014 10:00:27 115.9514 -31.82668    14
60  18/11/2014 10:00:28 115.9514 -31.82668    15

My desired output would be something like this, with the first one averaging 4 (11-14) and the next two averaging 5 (11-15):

     Date      Time     Long       Lat     NoInSeries  AvgLong     Avg Lat
12  17/11/2014 22:09:17 115.9508 -31.82850    11       115.9510   -31.82856
13  17/11/2014 22:09:18 115.9508 -31.82846    12          NA          NA
14  17/11/2014 22:09:19 115.9513 -31.82864    13          NA          NA
15  17/11/2014 22:09:21 115.9511 -31.82863    14          NA          NA
26  18/11/2014 00:07:14 115.9509 -31.82829    11       115.9509   -31.82830
27  18/11/2014 00:07:15 115.9509 -31.82829    12          NA          NA
28  18/11/2014 00:07:16 115.9509 -31.82830    13          NA          NA
29  18/11/2014 00:07:17 115.9509 -31.82830    14          NA          NA
30  18/11/2014 00:07:18 115.9509 -31.82831    15          NA          NA
56  18/11/2014 10:00:24 115.9513 -31.82670    11       115.9514   -31.82669
57  18/11/2014 10:00:25 115.9514 -31.82670    12          NA          NA
58  18/11/2014 10:00:26 115.9514 -31.82669    13          NA          NA
59  18/11/2014 10:00:27 115.9514 -31.82668    14          NA          NA
60  18/11/2014 10:00:28 115.9514 -31.82668    15          NA          NA

I would then go through and delete all rows where AvgLong==NA, so my final output would just have all the rows where number in series=11 with the averages.

I really don't know where to start with the code for this...the examples I have found have all discussed averaging an exact number of rows, rather than a variable number.

For instance:

c( tapply( x, (row(x)-1)%/%5, mean ) )

Or:

idx <- ceiling(seq_len(nrow(dd)) / 5)
# do colMeans on all columns except last one.
res <- lapply(split(dd[-(ncol(dd))], idx), colMeans, na.rm = TRUE)
# assign first value of "datetime" in each 5-er group as names to list
names(res) <- dd$datetime[seq(1, nrow(df), by=5)]
# bind them to give a matrix
res <- do.call(rbind, res)

Also, the answers I have seen generally then output the averages as a new data frame... Ultimately, I also want to have this averaging under a condition: if schedule is 'Multifix', I want to average 11 to however high it goes up to 15, whereas if schedule is 'Continuous', I want to average from 181 up until however high each one goes...). Something like this:

if(import.list$Schedule=='Multifix'){
...code to average Long and Lat for Number in Series from 11 up to however high it goes (up to 15)...
} else {
...code to average Long and Lat for Number in Series from 241 up to however high it goes...
}

Or perhaps I have an if else statement to define a variable and then use that variable in the function to do the averaging?

...but I imagine this condition could complicate things if the output creates a new dataframe, which is why I was aiming for just adding values to new columns "AvgLong" and "AvgLat." Thanks for any help!!


Solution

  • #dput function shows the data I was working from your question.

    dput(df1)
    structure(list(ID = c(12L, 13L, 14L, 15L, 26L, 27L, 28L, 29L, 
    30L, 56L, 57L, 58L, 59L, 60L), Date = c("17/11/2014", "17/11/2014", 
    "17/11/2014", "17/11/2014", "18/11/2014", "18/11/2014", "18/11/2014", 
    "18/11/2014", "18/11/2014", "18/11/2014", "18/11/2014", "18/11/2014", 
    "18/11/2014", "18/11/2014"), Time = c("22:09:17", "22:09:18", 
    "22:09:19", "22:09:21", "00:07:14", "00:07:15", "00:07:16", "00:07:17", 
    "00:07:18", "10:00:24", "10:00:25", "10:00:26", "10:00:27", "10:00:28"
    ), Long = c(115.9508, 115.9508, 115.9513, 115.9511, 115.9509, 
    115.9509, 115.9509, 115.9509, 115.9509, 115.9513, 115.9514, 115.9514, 
    115.9514, 115.9514), Lat = c(-31.8285, -31.82846, -31.82864, 
    -31.82863, -31.82829, -31.82829, -31.8283, -31.8283, -31.82831, 
    -31.8267, -31.8267, -31.82669, -31.82668, -31.82668), NoInSeries = c(11L, 
    12L, 13L, 14L, 11L, 12L, 13L, 14L, 15L, 11L, 12L, 13L, 14L, 15L
    )), .Names = c("ID", "Date", "Time", "Long", "Lat", "NoInSeries"
    ), class = "data.frame", row.names = c(NA, -14L))
    

    #get.counter gets the row index when the value of the column starts to decrease as opposed to ascending order.

    get.counter <- function(x){
      a1 = x
      counter = 0
      a2 = c()
      for( i in 1:length(a1)){  
        if(i < length(a1)){
          if(a1[i+1] > a1[i]){
            counter = counter + 1
          }else{
            counter = counter + 1
            a2 = c(a2, counter)
            counter = 0
          }
        }else{
          counter = counter + 1
          a2 = c(a2, counter)
        }
      }
      return(a2)
    }
    

    # avg.seg.col function outputs a data frame with the segmented average of a column. df1 is the input data frame, colvar is the column name (Eg: Long or Lat), and get_counter is the output from get.counter function.

    avg.seg.col <- function(df1, colvar, get_counter){ 
    
      long <- c()
    
      start = 1
    
      for(i in cumsum(get_counter)){
        end = i
        b1 = subset(df1, select = colvar)[start:end,]
    
        mean_b1 = mean(b1)
    
        long = c(long, mean_b1, rep(NA, (length(b1)-1)))
    
        start = end+1
      }
      return(data.frame(long, stringsAsFactors = FALSE))
    }
    

    # read in data from a text file using read.table function. You need to make sure your file exists in current working directory. Working directory can be set by setwd("path of current working directory")

    df1 <- read.table(file = "file1.txt", 
                      header = TRUE, 
                      sep = "\t", 
                      stringsAsFactors = FALSE)
    

    # apply get.counter function with a vector from df1$NoInSeries

    get_counter <- get.counter(df1$NoInSeries)
    

    # Apply avg.seg.col function for Long column

    AvgLong <- avg.seg.col(df1, "Long", get_counter)
    

    # Apply avg.seg.col function for Lat column

    AvgLat <- avg.seg.col(df1, "Lat", get_counter)
    

    # Merge data frames by column

    df2 <- do.call("cbind", list(df1, AvgLong, AvgLat))
    

    # Assign column names

    colnames(df2) <- c(colnames(df2)[1:(ncol(df2)-2)], "AvgLong", "AvgLat")
    

    Output:

         print(df2)
       ID       Date     Time     Long       Lat NoInSeries  AvgLong    AvgLat
    1  12 17/11/2014 22:09:17 115.9508 -31.82850         11 115.9510 -31.82856
    2  13 17/11/2014 22:09:18 115.9508 -31.82846         12       NA        NA
    3  14 17/11/2014 22:09:19 115.9513 -31.82864         13       NA        NA
    4  15 17/11/2014 22:09:21 115.9511 -31.82863         14       NA        NA
    5  26 18/11/2014 00:07:14 115.9509 -31.82829         11 115.9509 -31.82830
    6  27 18/11/2014 00:07:15 115.9509 -31.82829         12       NA        NA
    7  28 18/11/2014 00:07:16 115.9509 -31.82830         13       NA        NA
    8  29 18/11/2014 00:07:17 115.9509 -31.82830         14       NA        NA
    9  30 18/11/2014 00:07:18 115.9509 -31.82831         15       NA        NA
    10 56 18/11/2014 10:00:24 115.9513 -31.82670         11 115.9514 -31.82669
    11 57 18/11/2014 10:00:25 115.9514 -31.82670         12       NA        NA
    12 58 18/11/2014 10:00:26 115.9514 -31.82669         13       NA        NA
    13 59 18/11/2014 10:00:27 115.9514 -31.82668         14       NA        NA
    14 60 18/11/2014 10:00:28 115.9514 -31.82668         15       NA        NA
    

    #after removing rows with NA, the output looks like below

    df2[-(which(df2$AvgLong %in% NA)), ]
       ID       Date     Time     Long       Lat NoInSeries  AvgLong    AvgLat
    1  12 17/11/2014 22:09:17 115.9508 -31.82850         11 115.9510 -31.82856
    5  26 18/11/2014 00:07:14 115.9509 -31.82829         11 115.9509 -31.82830
    10 56 18/11/2014 10:00:24 115.9513 -31.82670         11 115.9514 -31.82669