Search code examples
rxts

R: yes-no factor based on previous entries


I've got a timeseries dataset — data from meteostation. So there's 3 columns: time - time and date; p - rain, mm; h - water level,m.

I need to make a new column factor_rain, with 1 and 0 values. 1 - if water level(df$h) was influenced by rain (df$p). This can be if there was a rain for the last 5 hours (5 entries). In other cases, there should be 0.

A part of dataset is here:

df <- data.frame(time = c("2017-06-04 9:00:00", "2017-06-04 13:00:00",  "2017-06-04 17:00:00",
                            "2017-06-04 19:00:00",  "2017-06-04 21:00:00",  "2017-06-04 23:00:00",
                            "2017-06-05 9:00:00",   "2017-06-05 11:00:00",
                            "2017-06-05 13:00:00",  "2017-06-05 16:00:00",
                            "2017-06-05 19:00:00",  "2017-06-05 21:00:00",  "2017-06-05 23:00:00",
                            "2017-06-06 9:00:00",   "2017-06-06 11:00:00",  "2017-06-06 13:00:00",
                            "2017-06-06 16:00:00",  "2017-06-06 17:00:00",  "2017-06-06 18:00:00",
                            "2017-06-06 19:00:00"),
                   p = c(NA, NA, 16.4, NA, NA, NA, NA, NA, NA, NA, 12, 
                         NA, NA, NA, NA, NA, NA, NA, NA, NA),
                   h = c(23,NA,NA,NA,NA,32,NA,NA,28,NA,NA,
                        33,NA,NA,NA,29,NA,NA,NA,NA))

I was trying the simplest way I thought — it works only for one case unfortunately:

> df$factor_rain[df$p[-c(1:5)] > 1 & df$h > 1] <- 1
> Warning message:
In df$p[-c(1:5)] > 1 & df$h > 1 :
  longer object length is not a multiple of shorter object length

Is there any way to fix it? If you can suggest how to use real time (smth from xts library, for example) it would be great. I mean use a 5 hours treshold, not 5 values.

By the way I need to get this as a result:

> df
                  time    p  h factor_rain
1   2017-06-04 9:00:00   NA 23           0
2  2017-06-04 13:00:00   NA NA           0
3  2017-06-04 17:00:00 16.4 NA           0
4  2017-06-04 19:00:00   NA NA           0
5  2017-06-04 21:00:00   NA NA           0
6  2017-06-04 23:00:00   NA 32           1
7   2017-06-05 9:00:00   NA NA           0
8  2017-06-05 11:00:00   NA NA           0
9  2017-06-05 13:00:00   NA 28           0
10 2017-06-05 16:00:00   NA NA           0
11 2017-06-05 19:00:00 12.0 NA           0
12 2017-06-05 21:00:00   NA 33           1
13 2017-06-05 23:00:00   NA NA           0
14  2017-06-06 9:00:00   NA NA           0
15 2017-06-06 11:00:00   NA NA           0
16 2017-06-06 13:00:00   NA 29           0
17 2017-06-06 16:00:00   NA NA           0
18 2017-06-06 17:00:00   NA NA           0
19 2017-06-06 18:00:00   NA NA           0
20 2017-06-06 19:00:00   NA NA           0

Solution

  • You can use

    df$factorrain = FALSE
    df$factorrain[rowSums(expand.grid(which(!is.na(df$p)), 0:4))] = TRUE
    
    #                   time    p  h factorrain
    # 1   2017-06-04 9:00:00   NA 23   FALSE
    # 2  2017-06-04 13:00:00   NA NA   FALSE
    # 3  2017-06-04 17:00:00 16.4 NA    TRUE
    # 4  2017-06-04 19:00:00   NA NA    TRUE
    # 5  2017-06-04 21:00:00   NA NA    TRUE
    # 6  2017-06-04 23:00:00   NA 32    TRUE
    # 7   2017-06-05 9:00:00   NA NA    TRUE
    # 8  2017-06-05 11:00:00   NA NA   FALSE
    # 9  2017-06-05 13:00:00   NA 28   FALSE
    # 10 2017-06-05 16:00:00   NA NA   FALSE
    # 11 2017-06-05 19:00:00 12.0 NA    TRUE
    # 12 2017-06-05 21:00:00   NA 33    TRUE
    # 13 2017-06-05 23:00:00   NA NA    TRUE
    # 14  2017-06-06 9:00:00   NA NA    TRUE
    # 15 2017-06-06 11:00:00   NA NA    TRUE
    # 16 2017-06-06 13:00:00   NA 29   FALSE
    # 17 2017-06-06 16:00:00   NA NA   FALSE
    # 18 2017-06-06 17:00:00   NA NA   FALSE
    # 19 2017-06-06 18:00:00   NA NA   FALSE
    # 20 2017-06-06 19:00:00   NA NA   FALSE
    

    Or, a similar approach with apply,

    df$factorrain = FALSE
    df$factorrain[sapply(which(!is.na(df$p)), function(x) x+(0:4))] = TRUE