In R: How to subset a large dataframe by top 5 longest runs of frequent values in 1 column?

I have a dataframe with 1 column. The values in this column can ONLY be "good" or "bad". I would like to find the top 5 largest runs of "bad".

I am able to use the rle(df) function to get the running length of all the "good" and "bad".

How do i find the 5 largest runs that attribute to ONLY "bad"?
How do i get the starting and ending indices of the top 5 largest runs for ONLY "bad"?

Your assistance is much appreciated!

Solution

One option would be rleid. Convert the 'data.frame' to 'data.table' (setDT(df1)), creating grouping column with rleid (generates a unique id based on adjacent non-matching elements, create the number of elements per group (n) as a column, and row number also as another column ('rn'), subset the rows where 'goodbad' is "bad", order 'n' in decreasing order, grouped by 'grp', summarise the 'first' and 'last' row numbe, as well as the entry for goodbad

library(data.table)
setDT(df1)[, grp := rleid(goodbad)][,  n := .N, grp][ ,
   rn := .I][goodbad == 'bad'][order(-n), .(goodbad = first(goodbad), 
    n = n, start = rn[1], last = rn[.N]), .(grp)
    ][n %in% head(unique(n), 5)][, grp := NULL][]

Or we can use rle and other base R methods

rl <- rle(df1$goodbad)
grp <- with(rl, rep(seq_along(values), lengths))
df2 <- transform(df1, grp = grp, n = rep(rl$lengths, rl$lengths),
            rn = seq_len(nrow(df1)))
df3 <- subset(df2, goodbad == 'bad')
do.call(data.frame, aggregate(rn ~ grp, subset(df3[order(-df3$n),], 
      n %in% head(unique(n), 5)), range))

data

set.seed(24)
df1 <- data.frame(goodbad = sample(c("good", "bad"), 100,
          replace = TRUE), stringsAsFactors = FALSE)