Search code examples
rdataframerun-length-encoding

Using rle() for indexing data.frame - how to show zero's in the function to maintain the same vector length?


In the following example, my goal is to show years in which consecutive numbers in df transposed to df_new are lower than threshold of -1.2 in 5 consecutive instances. I want to then return corresponding unique values from column df_new$year as result. My problem linking rle() function result is that the length does not correspond to df_new$year length and thus I can't index it properly. The issue with rle() function is that it does not return zero's and thus it returns only runs of at least 1 value below the threshold in k. How can I improve this bit of code to achieve what I need? Is there a way to force rle() to include zero's in k or shall I go with another approach?

# Example reproducible df:
set.seed(125)
df <- data.frame(V1=rnorm(10,-1.5,.5),
                 V2=rnorm(10,-1.5,.5),
                 V3=rnorm(10,-1.5,.5),
                 V4=rnorm(10,-1.5,.5),
                 V5=rnorm(10,-1.5,.5),
                 V6=rnorm(10,-1.5,.5),
                 V7=rnorm(10,-1.5,.5),
                 V8=rnorm(10,-1.5,.5),
                 V9=rnorm(10,-1.5,.5),
                 V10=rnorm(10,-1.5,.5))
library(data.table)
df_t <- t(df)
df_long <- melt(df_t)
df_long$year <- rep(1976:1985, each=nrow(df))
df_new <- data.frame(value=df_long$value,year=df_long$year)

# Threshold values:
 threshold = -1.2
    consecutiveentries = 5
    number <- consecutiveentries-1
# Start of the problem:
    k <- rle(df_new$value < threshold)
    years <- unique(df_new$year[k$lengths > number])

Current result:

> years
[1] 1976 1978 1979 1980 1982 1984 1985

What I would like it to be:

> years
    [1] 1976 1980 1983 1985

Solution

  • This is ugly but it works :)

    df_new$year[cumsum(k$lengths)[which(k$lengths >= 5)-1]+1]
    

    Each part:

    idx <- which(k$lengths >= 5)-1 gives you the indices of k$lengths right before a value is higher or equal to 4.

    With cumsum(k$lengths) we then build the cumulated sum over k$lengths and take the elements at idx. As a result we have the number of rows that occur before the first row that is part of a >=5 sequence.

    Adding 1 to this result gives us the indices of the rows where each sequence begins.