Search code examples
rloopsdata-analysis

Grouping rows of large dataset in R


I am trying to calculate driver activity using GPS data. I've written a loop that calculates the difference in time between two consecutive points in a dataframe over the range of values, summing it as it goes.

Here is an example of my data:

  DriveNo       Date.and.Time Latitude Longitude
1     156 2014-01-31 23:00:00 41.88367  12.48778
2     187 2014-01-31 23:00:01 41.92854  12.46904
3     297 2014-01-31 23:00:01 41.89107  12.49270
4      89 2014-01-31 23:00:01 41.79318  12.43212
5      79 2014-01-31 23:00:01 41.90028  12.46275
6     191 2014-01-31 23:00:02 41.85231  12.57741

Reprex:

taxi_noOutlier <- structure(list(DriveNo = c(156, 187, 297, 89, 79, 191),
  Date.and.Time = structure(c(1391209200.73917, 1391209201.14846,
    1391209201.22007, 1391209201.47085, 1391209201.63114, 1391209202.04855),
    class = c("POSIXct", "POSIXt"), tzone = "UTC"),
  Latitude = c(41.883670807, 41.928543091, 41.891067505, 41.793178558,
    41.900276184, 41.852306366),
  Longitude = c(12.48777771, 12.469037056, 12.492704391, 12.432122231,
    12.46274662, 12.577406883)),
  row.names = c(NA, 6L), class = "data.frame")

And the loop:

taxi_156 <- filter(taxi_noOutlier, DriveNo == 156)
datelist = taxi_156$Date.and.Time
dlstandard = as.POSIXlt(datelist)
diffsum <- as.numeric(sum(Filter(function(x) x <= 60, difftime(tail(dlstandard, -1), head(dlstandard, -1), units = 'secs'))))
print(paste("The total activity time for driver #156 is ", diffsum))

Which gives an output of:

[1] "The total activity time for driver #264 is  705655.37272048"

My question is, how can I expand this code to find the activity for each other driver? (There are 374 unique drivers, each with thousands of points.) I have tried to replicate the above code using a loop that would calculate the time difference for each DriveNo, but I am new to R and I my understanding of loop syntax isn't great.

Can I filter into separate dataframes using a method like this? (This gives an error to do with unexpected bracketing).

for (i in seq_along(taxi_noOutlier$DriveNo))
{
taxi_[[i]] <- filter(taxi_noOutlier, DriveNo == [[i]])
}

and then use my original code on each one? Or is there a more efficient way? Thanks


Solution

  • You can group_by each DriveNo get the difference between consecutive Date.and.Time, remove the values which are less than a minute and sum the differences.

    library(dplyr)
    
    taxi_noOutlier %>%
      group_by(DriveNo) %>%
      mutate(difftime =  difftime(Date.and.Time, lag(Date.and.Time), units = 'secs')) %>%
      filter(difftime <= 60) %>%
      summarise(diffsum = sum(as.numeric(difftime), na.rm = TRUE)) -> result
    
    result