Search code examples
rdataframedata-analysis

Way to separate out individual events


I have a large dataset with observations every 1/2 second. It represents a series of milkings in a robot. So a cow enters the robot and gets milked while different things are measured. Then the cow leaves and for a while there is no cow until another cow enters. I wrote code for an oversimplified sample of my data.

Sample Data

set.seed(66)
ID <- rep(c(84,-999,88,-999),c(5,3,7,5))
TimeStamp <- rep(c("09:31:12",NA,"09:45:31",NA),c(5,3,7,5))
VAR1 <- sample(x = 800:3000, 20)
mydf <- data.frame(ID=ID, TimeStamp=TimeStamp, VAR1 = VAR1)
eventIDs <- as.numeric(factor(mydf[, "TimeStamp"], exclude = NULL))
mydf[, "eventIDs"] <- eventIDs
mydf
#     ID TimeStamp VAR1 eventIDs
#1    84  09:31:12 2978        1
#2    84  09:31:12 2625        1
#3    84  09:31:12 2088        1
#4    84  09:31:12 1716        1
#5    84  09:31:12 2254        1
#6  -999      <NA> 1632        3
#7  -999      <NA> 1731        3
#8  -999      <NA> 2648        3
#9    88  09:45:31 1625        2
#10   88  09:45:31 2798        2
#11   88  09:45:31 1366        2
#12   88  09:45:31 1245        2
#13   88  09:45:31 1291        2
#14   88  09:45:31 2801        2
#15   88  09:45:31 2746        2
#16 -999      <NA> 1411        3
#17 -999      <NA> 1738        3
#18 -999      <NA> 1398        3
#19 -999      <NA> 1918        3
#20 -999      <NA> 1575        3

The ID variable in the data represents the cow/no-cow event. The -999 was decided on when ID == NA. The TimeStamp represents the start of milking. This variable is used to identify the eventIDs variable. This is important because over the course of the day cows can be milked multiple times. That's why it was created in order to separate out each milking event for each cow. VAR1 represents whatever variable that is being investigated.

Desired Output

I need the mean of VAR1 for each milking and each non-milking. Currently, this is the output I have:

(res1 <- aggregate(mydf[,"VAR1"],
                   by = list(ID = mydf[,"ID"], eventIDs = mydf[,"eventIDs"]),
                   FUN = mean))
#    ID eventIDs        x
#1   84        1 2332.200
#2   88        2 1981.714
#3 -999        3 1756.375

This is obvious results because there is no indicator variable that separates out each successive 'non-milking' event. In fact I would like this:

meanVAR1 <- c((2978+2625+2088+1716+2254)/5,
              (1632+1731+2648)/3,
              (1625+2798+1366+1245+1291+2801+2746)/7,
              (1411+1738+1398+1918+1575)/5) 
eventIDs <- c(1,3,2,3) 
(res2 <- data.frame(ID = ID1, meanVAR1 = meanVAR1, eventIDs = eventIDs))
#    ID meanVAR1 eventIDs
#1   84 2332.200        1
#2 -999 2003.667        3
#3   88 1981.714        2
#4 -999 1608.000        3

Any suggestions?


Solution

  • Assuming that your data is sorted as in your example chronologically. Using rleit is quite easy to create a proxy ID that is unique for each event:

    mydf$ID2 = rep(1:length(rle(mydf$ID)$lengths),times=rle(mydf$ID)$lengths)
    

    Then in conjunction with aggregate or as in my example data.table this new ID will help us get the desired result:

    setDT(mydf)[,list(meanVar1=mean(VAR1)),by=c("ID","ID2","eventIDs")]
       #       ID ID2 eventIDs meanVar1
       # 1:   84 1   1        1 2332.200
       # 2: -999 2   2        3 2003.667
       # 3:   88 3   3        2 1981.714
       # 4: -999 4   4        3 1608.000