Search code examples
rplotdatasetvisualizationtemporal

Plotting instantaneous point in time summation from temporal data set


I have a dataset of SQL queries, containing start and finish timestamps as well as the specific query that was run. I would like to use this information to visualize the number of concurrent queries executing at any given time.

That is to say with a dataset like such:

StartTime,StopTime,SQLStatement
10/11/2012 08:17:18.060,10/11/2012 08:17:19.221,DELETE FROM foo WHERE bar = ?;
10/11/2012 08:17:19.036,10/11/2012 08:17:19.911,SELECT * FROM users WHERE user_id = ?;
10/11/2012 08:17:19.661,10/11/2012 08:17:19.810,SELECT * FROM users WHERE user_id = ?;
...

I would like to produce plot where the horizontal axis represents time, and the vertical axis represents the instantaneous number of queries running. In the above example, there is one query running at 08:17:19.000 and two queries running at 08:17.19.100.

My naive approach is to iterate over the ~1 million discrete time values, counting the rows that have StartTime and StopTime spanning each value. This is slow and won't scale particularly well--and it seems like R would have a better way to do it!

A 15-minutes sample of this data contains ~170,000 rows with millisecond accuracy--methods that could reasonably scale to a day's worth of data would be appreciated, but any solution is welcome!


Solution

  • My simplistic approach

    # Millisecond precision for deltas
    options(digits.secs=3)
    
    # Load file
    log.file <- read.csv("data/raw.csv", header=T)
    
    log.file$StartTime <- as.POSIXct(log.file$StartTime ,
                                     format="%Y-%m-%d %H:%M:%OS")
    log.file$StopTime <- as.POSIXct(log.file$StopTime ,
                                    format="%Y-%m-%d %H:%M:%OS")
    
    range <- seq(min(log.file$StartTime), max(log.file$StopTime), by=1)
    
    getsum <- function(tstamp) {
      ret <- nrow(subset(log.file, StartTime<=tstamp & StopTime>=tstamp))
      return(ret)
    }
    
    dset <- frame()
    dset$TIME <- range
    dset$COUNT <- lapply(range, getsum)
    
    plot(dset$TIME, dset$COUNT)
    

    A better solution

    While this doesn't exactly meet the criteria I laid out earlier (i.e. doesn't fill in time gaps), it does give me a graph close to what I'm looking for. However, if there is a solution that hits the two points I asked for earlier, I'm still interested.

    # Millisecond precision for deltas
    options(digits.secs=3)
    
    # Load file
    log.file <- read.csv("data/raw.csv", header=T)
    
    log.file$StartTime <- as.POSIXct(log.file$StartTime ,
                                     format="%Y-%m-%d %H:%M:%OS")
    log.file$StopTime <- as.POSIXct(log.file$StopTime ,
                                    format="%Y-%m-%d %H:%M:%OS")
    
    # Create data frames for query start times and query stop times
    # and give them +1 and -1 values
    queries.start <- data.frame(Time=log.file$StartTime, Value=1)
    queries.stop <- data.frame(Time=log.file$StopTime, Value=-1)
    
    # Merge the two together and order by time
    queries.both <- rbind(queries.start, queries.stop)
    queries.both <- queries.both[with(queries.both, order(Time)), ]
    
    # Create a cumulative sum data frame, and plot the results
    queries.sum <- data.frame(Time=queries.both$Time, Queries=cumsum(queries.both$Value))
    plot(queries.sum, type="l")