Search code examples
rdataframeaggregatebioinformaticsplyr

Collapsing a basic dataframe based on adjecent rows


I am working on a big dataframe which could be represented by the following example:

chromosome  position    position2   name    Occup       
Chr1    1   1   -   0.023
Chr1    2   2   -   0.023
Chr1    3   3   -   0.023
Chr1    4   4   -   0.023
Chr1    5   5   -   0.023
Chr1    6   6   -   0.069
Chr1    7   7   -   0.069
Chr1    8   8   -   0.069
Chr1    9   9   -   0.069
Chr1    10  10  -   0.116
Chr1    11  11  -   0.116
Chr1    12  12  -   0.116
Chr1    13  13  -   0.023
Chr1    14  14  -   0.023
Chr1    15  15  -   0.023
Chr1    16  16  -   0.023
Chr1    17  17  -   0.023

You can read it in as:

dtf = data.frame(chromosome=c("Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1"), 
                position=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17), 
                position2=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17),        
                name=c("-","-","-","-","-","-","-","-","-","-","-","-","-","-","-","-","-"), 
                Occup=c(0.023,0.023,0.023,0.023,0.023,0.069,0.069,0.069,0.069,0.116,0.116,0.116,0.023,0.023,0.023,0.023,0.023))

I want to collapse it to a dataframe like this:

chromosome  position    position2   name    Occup       
Chr1    1   5   -   0.023
Chr1    6   9   -   0.069
Chr1    10  12  -   0.116
Chr1    13  17  -   0.023

The problem with basic collapsing would be that Occup values are placed together in 1 group.. Which is not what I want. I want them to be clustered within a group untill the next row changes.

If I do:

library(plyr)
test<-ddply(dtf, .(Occup), summarise,
      position_start=min(position),
      position_end= max(position2))

I get

Occup   position_start  position_end    
0.023   1   17
0.069   6   9
0.116   10  12

So it is close to what I want but not what I want.

There is no need to take column 1 or 3 into account as these columns are arbitrary in this case and contain the same information over all rows.


Solution

  • This should work:

    library(dplyr)
    
    dtf_grouped <- dtf %>%
        arrange(position) %>% # to ensure data is sequential
        mutate(
            occup_shift = Occup - lag(Occup, 1) != 0, # flag row change
            occup_shift = ifelse(is.na(occup_shift), FALSE, occup_shift), # replace NA's
            group_id = cumsum(occup_shift)
            ) %>%
        group_by(group_id) %>%
        summarize(
            Occup = min(Occup),
            position_start = position[1],
            position_end = position2[n()]
        ) %>%
        select(-group_id)
    
    head(dtf_grouped)
    
    # A tibble: 4 x 3
       Occup position_start position_end
       <dbl>          <dbl>        <dbl>
    1 0.0230              1            5
    2 0.0690              6            9
    3 0.116              10           12
    4 0.0230             13           17