I have a large dataset with observations every 1/2 second. It represents a series of milkings in a robot. So a cow enters the robot and gets milked while different things are measured. Then the cow leaves and for a while there is no cow until another cow enters. I wrote code for an oversimplified sample of my data.
Sample Data
set.seed(66)
ID <- rep(c(84,-999,88,-999),c(5,3,7,5))
TimeStamp <- rep(c("09:31:12",NA,"09:45:31",NA),c(5,3,7,5))
VAR1 <- sample(x = 800:3000, 20)
mydf <- data.frame(ID=ID, TimeStamp=TimeStamp, VAR1 = VAR1)
eventIDs <- as.numeric(factor(mydf[, "TimeStamp"], exclude = NULL))
mydf[, "eventIDs"] <- eventIDs
mydf
# ID TimeStamp VAR1 eventIDs
#1 84 09:31:12 2978 1
#2 84 09:31:12 2625 1
#3 84 09:31:12 2088 1
#4 84 09:31:12 1716 1
#5 84 09:31:12 2254 1
#6 -999 <NA> 1632 3
#7 -999 <NA> 1731 3
#8 -999 <NA> 2648 3
#9 88 09:45:31 1625 2
#10 88 09:45:31 2798 2
#11 88 09:45:31 1366 2
#12 88 09:45:31 1245 2
#13 88 09:45:31 1291 2
#14 88 09:45:31 2801 2
#15 88 09:45:31 2746 2
#16 -999 <NA> 1411 3
#17 -999 <NA> 1738 3
#18 -999 <NA> 1398 3
#19 -999 <NA> 1918 3
#20 -999 <NA> 1575 3
The ID
variable in the data represents the cow/no-cow event. The -999
was decided on when ID == NA
. The TimeStamp
represents the start of milking. This variable is used to identify the eventIDs
variable. This is important because over the course of the day cows can be milked multiple times. That's why it was created in order to separate out each milking event for each cow. VAR1
represents whatever variable that is being investigated.
Desired Output
I need the mean of VAR1
for each milking and each non-milking. Currently, this is the output I have:
(res1 <- aggregate(mydf[,"VAR1"],
by = list(ID = mydf[,"ID"], eventIDs = mydf[,"eventIDs"]),
FUN = mean))
# ID eventIDs x
#1 84 1 2332.200
#2 88 2 1981.714
#3 -999 3 1756.375
This is obvious results because there is no indicator variable that separates out each successive 'non-milking' event. In fact I would like this:
meanVAR1 <- c((2978+2625+2088+1716+2254)/5,
(1632+1731+2648)/3,
(1625+2798+1366+1245+1291+2801+2746)/7,
(1411+1738+1398+1918+1575)/5)
eventIDs <- c(1,3,2,3)
(res2 <- data.frame(ID = ID1, meanVAR1 = meanVAR1, eventIDs = eventIDs))
# ID meanVAR1 eventIDs
#1 84 2332.200 1
#2 -999 2003.667 3
#3 88 1981.714 2
#4 -999 1608.000 3
Any suggestions?
Assuming that your data is sorted as in your example chronologically. Using rle
it is quite easy to create a proxy ID that is unique for each event:
mydf$ID2 = rep(1:length(rle(mydf$ID)$lengths),times=rle(mydf$ID)$lengths)
Then in conjunction with aggregate
or as in my example data.table
this new ID will help us get the desired result:
setDT(mydf)[,list(meanVar1=mean(VAR1)),by=c("ID","ID2","eventIDs")]
# ID ID2 eventIDs meanVar1
# 1: 84 1 1 1 2332.200
# 2: -999 2 2 3 2003.667
# 3: 88 3 3 2 1981.714
# 4: -999 4 4 3 1608.000