Search code examples
rrun-length-encoding

In R: How to subset a large dataframe by top 5 longest runs of frequent values in 1 column?


I have a dataframe with 1 column. The values in this column can ONLY be "good" or "bad". I would like to find the top 5 largest runs of "bad".

I am able to use the rle(df) function to get the running length of all the "good" and "bad".

  1. How do i find the 5 largest runs that attribute to ONLY "bad"?
  2. How do i get the starting and ending indices of the top 5 largest runs for ONLY "bad"?

Your assistance is much appreciated!


Solution

  • One option would be rleid. Convert the 'data.frame' to 'data.table' (setDT(df1)), creating grouping column with rleid (generates a unique id based on adjacent non-matching elements, create the number of elements per group (n) as a column, and row number also as another column ('rn'), subset the rows where 'goodbad' is "bad", order 'n' in decreasing order, grouped by 'grp', summarise the 'first' and 'last' row numbe, as well as the entry for goodbad

    library(data.table)
    setDT(df1)[, grp := rleid(goodbad)][,  n := .N, grp][ ,
       rn := .I][goodbad == 'bad'][order(-n), .(goodbad = first(goodbad), 
        n = n, start = rn[1], last = rn[.N]), .(grp)
        ][n %in% head(unique(n), 5)][, grp := NULL][]
    

    Or we can use rle and other base R methods

    rl <- rle(df1$goodbad)
    grp <- with(rl, rep(seq_along(values), lengths))
    df2 <- transform(df1, grp = grp, n = rep(rl$lengths, rl$lengths),
                rn = seq_len(nrow(df1)))
    df3 <- subset(df2, goodbad == 'bad')
    do.call(data.frame, aggregate(rn ~ grp, subset(df3[order(-df3$n),], 
          n %in% head(unique(n), 5)), range))
    

    data

    set.seed(24)
    df1 <- data.frame(goodbad = sample(c("good", "bad"), 100,
              replace = TRUE), stringsAsFactors = FALSE)