I would very much appreciate your help on this one. I'm trying to condense a data frame of 200,000+ rows, where the integer of one row in column "start" is the exact to the next consecutive row in column "end". For reference, these are chromosomal base pair positions, and an example code below:
genomic_ranges <- data.frame(sample_ID = c("A", "B", "B", "B", "C"),
start = c(1, 20, 30, 40, 250),
end = c(5, 30, 40, 70, 400),
feature = c("normal", "DUP", "DUP", "DUP", "DUP"))
sample_ID start end feature
1 A 1 5 "normal"
2 B 20 30 "DUP"
3 B 30 40 "DUP"
4 B 40 70 "DUP"
5 C 250 400 "DUP"
I have tried logical vectors, boolean operators, ifelse statements, forloops etc, I can't find a way to 1)delete the rows showing middle ranges, and 2)paste together the 1st and last rows which contain the true start and end position of the range.
Some of what I've tried:
ifelse(cnv_catalogue_final$end == cnv_catalogue_final$start, "to_delete", "other"))
cnv_catalogue_final$end %in% cnv_catalogue_final$start
dplyr::filter(slice_min(start, x) | slice_max(end, x))
Even if I use something like this (StartA <= EndB) and (EndA >= StartB)
I'll still be loosing either the start or end position.
*Edit: thank you all for your feedback! I've updated the question with code. These rows do have ID's identified by sample_ID. Ideally, I would like 1 row with the complete range of 20-70, instead of it being cut into segments of 20-30, 30-40, and 40-70, in 3 rows with the same sample_ID identifier.
There are several ways to achieve this, here is one:
library(tidyverse)
genomic_ranges %>%
group_by(sample_ID) %>%
summarize(start = min(start),
end = max(end),
feature = feature[1])
which gives:
# A tibble: 3 x 4
sample_ID start end feature
<chr> <dbl> <dbl> <chr>
1 A 1 5 normal
2 B 20 70 DUP
3 C 250 400 DUP