I have a list of data frames that each have year, month, day, and temperature columns along with some other stuff. I have already figured out (with stack overflow help thanks guys) how to subset the data frames so that they only contain the data that has the temperature above a certain threshold. However, now I need to take those subsets and find another subset of that of the days that are consecutive (basically finding all the rows where the temperature was above the threshold for 3+ days).
example data:
set.seed(1234)
A <- data.frame("D" = c(sample(1:5, size = 1000, replace = TRUE)), "Temp" = c(sample(0:35, size = 1000, replace = TRUE)))
B <- data.frame("D" = c(sample(1:5, size = 1000, replace = TRUE)), "Temp" = c(sample(-10:22, size = 1000, replace = TRUE)))
C <- data.frame("D" = c(sample(1:5, size = 1000, replace = TRUE)), "Temp" = c(sample(3:42, size = 1000, replace = TRUE)))
climate <- list("Alist" = A, "Blist" = B, "Clist" = C)
# courtesy of ThomasIsCoding
tm <- lapply(
climate,
function(x) {
subset(
x,
Temp > quantile(Temp, probs = 0.90)
)
}
)
tm
So far I've tried using lapply, diff, and rle to find the instances where days were consecutive. I took my list of dataframes, used lapply to select the days column, then used lapply again to find the difference between each element in the days column, and then used lapply again to do rle.
result <- lapply(lapply(lapply(
tm, '[[',1), # this selects the days column in all the dataframes in the list tmax95
diff), # this finds the difference between each element in the days column to help determine if they are consecutive
rle) # i don't really understand rle fully but to my understanding this takes all the differences and counts how many there are in order or something like that
result
Then I use another lapply to find the instances where there are more than 2 differences of 1 day (consecutive)
fin <- lapply(result, function(x) x$lengths>=2 & x$values==1)
fin
However, this isn't quite what I want as an end result as this returns a list of lists of Trues and Falses, when what I'd like is the rows of data associated with the Trues that are being returned. How would I be able to get my output to be the data with consecutive days instead of the Trues?
You can try the following code, which first applies the Temp threshold to each climate using subset
(as you did), and then subsets these using diff
to find the rows that have 3 consecutive days.
result <- lapply(climate, \(x) {
y <- subset(x,
Temp > quantile(Temp, probs = 0.90)
)
y[unique(
sort(
unlist(
lapply(
which(
c(diff(y[,'D'])==1, FALSE) & c(diff(y[,'D'], diff=2)==0, FALSE, FALSE)),
\(x) x + 0:2)
))),]
}
)
lapply(result, head)
$Alist
D Temp
317 2 35
326 3 33
341 4 34
579 3 34
582 4 33
584 5 35
$Blist
D Temp
357 2 21
358 3 22
361 4 20
$Clist
D Temp
257 2 40
258 3 40
269 4 41
350 1 41
351 2 42
364 3 41
Explanation:
diff()
takes differences of each element. Setting this to == 1 finds successive days. diff(..., diff=2)==0
finds successive differences that are equal. Together, these two conditions find the first of 3 successive days. Adding FALSE
ensures the vector has the correct length since differencing reduces the length by 1 each time.
which
finds the indices of the first successive 3 days. +0:2
gives the indices for the 3 consecutive days.
lapply
does the loop for each day and we need to unlist
that to get a vector, sort it, and then remove duplicates with unique
.
Data (climate
) provided in question with set.seed(1234)
.