I am working on a big dataframe which could be represented by the following example:
chromosome position position2 name Occup
Chr1 1 1 - 0.023
Chr1 2 2 - 0.023
Chr1 3 3 - 0.023
Chr1 4 4 - 0.023
Chr1 5 5 - 0.023
Chr1 6 6 - 0.069
Chr1 7 7 - 0.069
Chr1 8 8 - 0.069
Chr1 9 9 - 0.069
Chr1 10 10 - 0.116
Chr1 11 11 - 0.116
Chr1 12 12 - 0.116
Chr1 13 13 - 0.023
Chr1 14 14 - 0.023
Chr1 15 15 - 0.023
Chr1 16 16 - 0.023
Chr1 17 17 - 0.023
You can read it in as:
dtf = data.frame(chromosome=c("Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1"),
position=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17),
position2=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17),
name=c("-","-","-","-","-","-","-","-","-","-","-","-","-","-","-","-","-"),
Occup=c(0.023,0.023,0.023,0.023,0.023,0.069,0.069,0.069,0.069,0.116,0.116,0.116,0.023,0.023,0.023,0.023,0.023))
I want to collapse it to a dataframe like this:
chromosome position position2 name Occup
Chr1 1 5 - 0.023
Chr1 6 9 - 0.069
Chr1 10 12 - 0.116
Chr1 13 17 - 0.023
The problem with basic collapsing would be that Occup values are placed together in 1 group.. Which is not what I want. I want them to be clustered within a group untill the next row changes.
If I do:
library(plyr)
test<-ddply(dtf, .(Occup), summarise,
position_start=min(position),
position_end= max(position2))
I get
Occup position_start position_end
0.023 1 17
0.069 6 9
0.116 10 12
So it is close to what I want but not what I want.
There is no need to take column 1 or 3 into account as these columns are arbitrary in this case and contain the same information over all rows.
This should work:
library(dplyr)
dtf_grouped <- dtf %>%
arrange(position) %>% # to ensure data is sequential
mutate(
occup_shift = Occup - lag(Occup, 1) != 0, # flag row change
occup_shift = ifelse(is.na(occup_shift), FALSE, occup_shift), # replace NA's
group_id = cumsum(occup_shift)
) %>%
group_by(group_id) %>%
summarize(
Occup = min(Occup),
position_start = position[1],
position_end = position2[n()]
) %>%
select(-group_id)
head(dtf_grouped)
# A tibble: 4 x 3
Occup position_start position_end
<dbl> <dbl> <dbl>
1 0.0230 1 5
2 0.0690 6 9
3 0.116 10 12
4 0.0230 13 17