I've got a timeseries dataset — data from meteostation. So there's 3 columns: time
- time and date; p
- rain, mm; h
- water level,m.
I need to make a new column factor_rain
, with 1
and 0
values. 1
- if water level(df$h
) was influenced by rain (df$p
). This can be if there was a rain for the last 5 hours (5 entries).
In other cases, there should be 0
.
A part of dataset is here:
df <- data.frame(time = c("2017-06-04 9:00:00", "2017-06-04 13:00:00", "2017-06-04 17:00:00",
"2017-06-04 19:00:00", "2017-06-04 21:00:00", "2017-06-04 23:00:00",
"2017-06-05 9:00:00", "2017-06-05 11:00:00",
"2017-06-05 13:00:00", "2017-06-05 16:00:00",
"2017-06-05 19:00:00", "2017-06-05 21:00:00", "2017-06-05 23:00:00",
"2017-06-06 9:00:00", "2017-06-06 11:00:00", "2017-06-06 13:00:00",
"2017-06-06 16:00:00", "2017-06-06 17:00:00", "2017-06-06 18:00:00",
"2017-06-06 19:00:00"),
p = c(NA, NA, 16.4, NA, NA, NA, NA, NA, NA, NA, 12,
NA, NA, NA, NA, NA, NA, NA, NA, NA),
h = c(23,NA,NA,NA,NA,32,NA,NA,28,NA,NA,
33,NA,NA,NA,29,NA,NA,NA,NA))
I was trying the simplest way I thought — it works only for one case unfortunately:
> df$factor_rain[df$p[-c(1:5)] > 1 & df$h > 1] <- 1
> Warning message:
In df$p[-c(1:5)] > 1 & df$h > 1 :
longer object length is not a multiple of shorter object length
Is there any way to fix it? If you can suggest how to use real time (smth from xts
library, for example) it would be great. I mean use a 5 hours treshold, not 5 values.
By the way I need to get this as a result:
> df
time p h factor_rain
1 2017-06-04 9:00:00 NA 23 0
2 2017-06-04 13:00:00 NA NA 0
3 2017-06-04 17:00:00 16.4 NA 0
4 2017-06-04 19:00:00 NA NA 0
5 2017-06-04 21:00:00 NA NA 0
6 2017-06-04 23:00:00 NA 32 1
7 2017-06-05 9:00:00 NA NA 0
8 2017-06-05 11:00:00 NA NA 0
9 2017-06-05 13:00:00 NA 28 0
10 2017-06-05 16:00:00 NA NA 0
11 2017-06-05 19:00:00 12.0 NA 0
12 2017-06-05 21:00:00 NA 33 1
13 2017-06-05 23:00:00 NA NA 0
14 2017-06-06 9:00:00 NA NA 0
15 2017-06-06 11:00:00 NA NA 0
16 2017-06-06 13:00:00 NA 29 0
17 2017-06-06 16:00:00 NA NA 0
18 2017-06-06 17:00:00 NA NA 0
19 2017-06-06 18:00:00 NA NA 0
20 2017-06-06 19:00:00 NA NA 0
You can use
df$factorrain = FALSE
df$factorrain[rowSums(expand.grid(which(!is.na(df$p)), 0:4))] = TRUE
# time p h factorrain
# 1 2017-06-04 9:00:00 NA 23 FALSE
# 2 2017-06-04 13:00:00 NA NA FALSE
# 3 2017-06-04 17:00:00 16.4 NA TRUE
# 4 2017-06-04 19:00:00 NA NA TRUE
# 5 2017-06-04 21:00:00 NA NA TRUE
# 6 2017-06-04 23:00:00 NA 32 TRUE
# 7 2017-06-05 9:00:00 NA NA TRUE
# 8 2017-06-05 11:00:00 NA NA FALSE
# 9 2017-06-05 13:00:00 NA 28 FALSE
# 10 2017-06-05 16:00:00 NA NA FALSE
# 11 2017-06-05 19:00:00 12.0 NA TRUE
# 12 2017-06-05 21:00:00 NA 33 TRUE
# 13 2017-06-05 23:00:00 NA NA TRUE
# 14 2017-06-06 9:00:00 NA NA TRUE
# 15 2017-06-06 11:00:00 NA NA TRUE
# 16 2017-06-06 13:00:00 NA 29 FALSE
# 17 2017-06-06 16:00:00 NA NA FALSE
# 18 2017-06-06 17:00:00 NA NA FALSE
# 19 2017-06-06 18:00:00 NA NA FALSE
# 20 2017-06-06 19:00:00 NA NA FALSE
Or, a similar approach with apply,
df$factorrain = FALSE
df$factorrain[sapply(which(!is.na(df$p)), function(x) x+(0:4))] = TRUE