I would like to concatenate individual genomic intervals into common regions.
My input:
dfin <- "chr start end sample type
1 10 20 NE1 loss
1 5 15 NE2 gain
1 25 30 NE1 gain
2 40 50 NE1 loss
2 40 60 NE2 loss
3 20 30 NE1 gain"
dfin <- read.table(text=dfin, header=T)
My expected output:
dfout <- "chr start end samples type
1 5 20 NE1-NE2 both
1 25 30 NE1 gain
2 40 60 NE1-NE2 loss
3 20 30 NE1 gain"
dfout <- read.table(text=dfout, header=T)
The intervals in dfin
will never overlap in the same animal, just between animals (columns sample
and samples
, respectively). The column type
have two factors (loss
and gain
) in dfin
and is expected to have three factors in dfout
(loss
, gain
and both
, which occur when the concatenated region in dfout
was based on both loss
and gain
).
Any idea to deal with that?
*Updated for @David Arenburg
(expanding on the comment) You could use the "IRanges" package:
library(IRanges)
#build an appropriate object
dat = RangedData(space = dfin$chr,
IRanges(dfin$start, dfin$end),
sample = dfin$sample,
type = dfin$type)
dat
#concatenate overlaps with an extra step of saving the concatenation mappings
ans = RangedData(reduce(ranges(dat), with.revmap = TRUE))
ans
Could not figure out how to avoid reduce
losing columns of "RangedData" object, but having the mappings saved we could do something like (there might be a more appropriate -according to "IRanges"- way to extract mappings, but I was not able to find it):
tmp = elementMetadata(ranges(ans)@unlistData)$revmap@partitioning
maps = rep(seq_along(start(tmp)), width(tmp))
maps
#[1] 1 1 2 3 3 4
Having the mappings of interval concatenation, we could aggregate "sample" and "type" to get the final form. E.g.:
tapply(dfin$sample, maps, function(X) paste(unique(X), collapse = "-"))
# 1 2 3 4
#"NE1-NE2" "NE1" "NE1-NE2" "NE1"