r dataframe aggregate bioinformatics plyr

Collapsing a basic dataframe based on adjecent rows

I am working on a big dataframe which could be represented by the following example:

chromosome  position    position2   name    Occup       
Chr1    1   1   -   0.023
Chr1    2   2   -   0.023
Chr1    3   3   -   0.023
Chr1    4   4   -   0.023
Chr1    5   5   -   0.023
Chr1    6   6   -   0.069
Chr1    7   7   -   0.069
Chr1    8   8   -   0.069
Chr1    9   9   -   0.069
Chr1    10  10  -   0.116
Chr1    11  11  -   0.116
Chr1    12  12  -   0.116
Chr1    13  13  -   0.023
Chr1    14  14  -   0.023
Chr1    15  15  -   0.023
Chr1    16  16  -   0.023
Chr1    17  17  -   0.023

You can read it in as:

dtf = data.frame(chromosome=c("Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1"), 
                position=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17), 
                position2=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17),        
                name=c("-","-","-","-","-","-","-","-","-","-","-","-","-","-","-","-","-"), 
                Occup=c(0.023,0.023,0.023,0.023,0.023,0.069,0.069,0.069,0.069,0.116,0.116,0.116,0.023,0.023,0.023,0.023,0.023))

I want to collapse it to a dataframe like this:

chromosome  position    position2   name    Occup       
Chr1    1   5   -   0.023
Chr1    6   9   -   0.069
Chr1    10  12  -   0.116
Chr1    13  17  -   0.023

The problem with basic collapsing would be that Occup values are placed together in 1 group.. Which is not what I want. I want them to be clustered within a group untill the next row changes.

If I do:

library(plyr)
test<-ddply(dtf, .(Occup), summarise,
      position_start=min(position),
      position_end= max(position2))

I get

Occup   position_start  position_end    
0.023   1   17
0.069   6   9
0.116   10  12

So it is close to what I want but not what I want.

There is no need to take column 1 or 3 into account as these columns are arbitrary in this case and contain the same information over all rows.

Solution

This should work:

library(dplyr)

dtf_grouped <- dtf %>%
    arrange(position) %>% # to ensure data is sequential
    mutate(
        occup_shift = Occup - lag(Occup, 1) != 0, # flag row change
        occup_shift = ifelse(is.na(occup_shift), FALSE, occup_shift), # replace NA's
        group_id = cumsum(occup_shift)
        ) %>%
    group_by(group_id) %>%
    summarize(
        Occup = min(Occup),
        position_start = position[1],
        position_end = position2[n()]
    ) %>%
    select(-group_id)

head(dtf_grouped)

# A tibble: 4 x 3
   Occup position_start position_end
   <dbl>          <dbl>        <dbl>
1 0.0230              1            5
2 0.0690              6            9
3 0.116              10           12
4 0.0230             13           17