Please note: this is a hyper simplified explanation of where the 'data' comes from, but where the data is from is irrelevant to the coding question.
I have a data set created by collecting water in a tube everyday. I can't go and measure the tube every day (but the tube keeps filling) so there are gaps in the water value records. This dummy data set shows where this has happened on days 5 and 10, because this is a dummy dataset I have made an assumption that each day 500ml of water goes into the tube (the real data set is a alot messier!)
day<-c(1,2,3,4,5,6,7,8,9,10,11,12)
value<-c(500,500,500,500,NA,1000,NA,NA,NA,2000,500,500)
df<-data.frame(day,value)
Data explanation: I have collected every day for days 1:4 so the value for each day is 500ml, missed day 5 so the value is NA, collected on day 6 so the value is 1000ml (the water from day 5 and day 6 combined), missed 7,8,9, so values equal NA, collected on day 10 to give a value of 2000ml for the 4 days) then collected every day for the last two)
I would like to fill in the NA gaps by taking the value of the next 'real' measurement and dividing that value between the NA's and that value's day.Yes, I am assuming that if I have not made a measurement there is a constant process and that I can divide the last measurement equally between the days.
day<-c(1,2,3,4,5,6,7,8,9,10,11,12)
corrected.value<-c(500,500,500,500,500,500,500,500,500,500,500,500)
corrected.df<-data.frame(day,corrected.value)
Again this is just a dummy data set otherwise the easiest way would just be replace NA with 500 with 'value[is.na(value)] <- 500
', but in the real data set the values can be 457.6, 779, 376, etc.
Also tried to do a loop but keep getting stuck...
Any ideas on how I can do this?
Help is greatly appreciated
Here's a possible solution :
# Create test Data:
# note that this is slightly different from your input
# but in this way you can better verify that it works as expected
day<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
value<-c(NA,500,500,500,NA,3000,NA,NA,NA,5000,500,500,NA,NA,NA)
df<-data.frame(day,value)
# "Cleansing" starts here :
RLE <- rle(is.na(df$value))
# we cannot do anything if last values are NAs, we'll just keep them in the data.frame
if(tail(RLE$values,1)){
RLE$lengths <- head(RLE$lengths,-1)
RLE$values <- head(RLE$values,-1)
}
afterNA <- cumsum(RLE$lengths)[RLE$values] + 1
firstNA <- (cumsum(RLE$lengths)- RLE$lengths + 1)[RLE$values]
occurences <- afterNA - firstNA + 1
replacements <- df$value[afterNA] / occurences
df$value[unlist(Map(f=seq.int,firstNA,afterNA))] <- rep.int(replacements,occurences)
Result :
> df
day value
1 1 250
2 2 250
3 3 500
4 4 500
5 5 1500
6 6 1500
7 7 1250
8 8 1250
9 9 1250
10 10 1250
11 11 500
12 12 500
13 13 NA
14 14 NA
15 15 NA