Search code examples
rdplyr

Assigning an identifier to a sequence of events, avoiding false observations


Sample data I receive from the device's operation recorder

df1 <- read.table(text = "temp.1
heating
heating
heating
heating
heating
heating
heating
heating
cooling
heating
heating
heating
heating
heating
heating
cooling
cooling
cooling
cooling
cooling
cooling
cooling
heating
heating
heating
cooling
cooling
heating
heating
heating
cooling
heating
heating
heating
heating
cooling
cooling
cooling
cooling
heating
heating
heating
cooling
heating
cooling
heating
cooling
heating
heating
heating
heating", header = TRUE)

Occasionally, a single (up to double) "cooling" observation will occur during "heating". This is an error and I would like these values to be ignored. I would like to mark the duty cycles after such a correction. The marking should also contain a sequential number - information is needed on how many heating and cooling cycles occurred on a given day Expected result:

> df1
    temp.1 level
1  heating   H.1
2  heating   H.1
3  heating   H.1
4  heating   H.1
5  heating   H.1
6  heating   H.1
7  heating   H.1
8  heating   H.1
9  cooling   H.1
10 heating   H.1
11 heating   H.1
12 heating   H.1
13 heating   H.1
14 heating   H.1
15 heating   H.1
16 cooling   C.1
17 cooling   C.1
18 cooling   C.1
19 cooling   C.1
20 cooling   C.1
21 cooling   C.1
22 cooling   C.1
23 heating   H.2
24 heating   H.2
25 heating   H.2
26 cooling   H.2
27 cooling   H.2
28 heating   H.2
29 heating   H.2
30 heating   H.2
31 cooling   H.2
32 heating   H.2
33 heating   H.2
34 heating   H.2
35 heating   H.2
36 cooling   C.2
37 cooling   C.2
38 cooling   C.2
39 cooling   C.2
40 heating   H.3
41 heating   H.3
42 heating   H.3
43 cooling   H.3
44 heating   H.3
45 cooling   H.3
46 heating   H.3
47 cooling   H.3
48 heating   H.3
49 heating   H.3
50 heating   H.3
51 heating   H.3

EDIT2: There was one more case I hadn't anticipated and my query wasn't precise. Please look at verses 51-53. When a "cooling" series is interrupted by a single "heating" it should also be ignored. I tried to modify your solution, but I had no success

df1
     temp.1 level
 1: heating   H.1
 2: heating   H.1
 3: heating   H.1
 4: heating   H.1
 5: heating   H.1
 6: heating   H.1
 7: heating   H.1
 8: heating   H.1
 9: cooling   H.1
10: heating   H.1
11: heating   H.1
12: heating   H.1
13: heating   H.1
14: heating   H.1
15: heating   H.1
16: cooling   C.1
17: cooling   C.1
18: cooling   C.1
19: cooling   C.1
20: cooling   C.1
21: cooling   C.1
22: cooling   C.1
23: heating   H.2
24: heating   H.2
25: heating   H.2
26: cooling   H.2
27: cooling   H.2
28: heating   H.2
29: heating   H.2
30: heating   H.2
31: cooling   H.2
32: heating   H.2
33: heating   H.2
34: heating   H.2
35: heating   H.2
36: cooling   C.2
37: cooling   C.2
38: cooling   C.2
39: cooling   C.2
40: heating   H.3
41: heating   H.3
42: heating   H.3
43: cooling   H.3
44: heating   H.3
45: cooling   H.3
46: heating   H.3
47: cooling   C.3
48: cooling   C.3
49: cooling   C.3
50: cooling   C.3
51: cooling   C.3
52: heating   C.3
53: cooling   C.3
54: cooling   C.3
55: cooling   C.3
56: heating   H.4
57: heating   H.4
58: heating   H.4

Appearing "cooling" after "heating" 3 times or "heating" after "cooling" 3 times changes the category to "level". Therefore, lines 26-27 are considered errors, and lines 23-25 are supposed to change the "level".


Solution

  • a data.table approach

    library(data.table)
    # set to data.table format
    setDT(df1)
    # initialise heating or cooling level
    df1[, level := toupper(substr(temp.1,1,1))]
    # override level of groupsizes size 2 or less with "H"
    df1[, level := if (.N <= 2) "H", by = .(rleid(temp.1))]
    # tamporary value for indexing, can be dropped at the end
    df1[, temp := rleid(level)]
    # create the correct level id, and afterwards drop the temp column
    df1[, level := paste(level, as.integer(factor(temp)), sep = "."), by = .(level)][, temp := NULL][]
    

    update for updated sample data / desired output

    library(data.table)
    setDT(df1)
    # determine groups of 3 (or more) consecutive temp.1
    df1[, group := if (.N >= 3) .GRP, by = .(rleid(temp.1))]
    # fill down missing groupnumbers
    setnafill(df1, type = "locf", cols = "group")
    # set level letter (from initial answer)
    df1[, level := toupper(substr(temp.1[1],1,1)), by = .(group)]
    df1[, temp := rleid(level)]
    df1[, level := paste(level, as.integer(factor(temp)), sep = "."), by = .(level)][, temp := NULL][]