I am dealing a with a very large mRNA splicing dataset. Here is a toy dataset to exemplify the problem:
test_df <- data.frame(
start = c(2, 9, 13, 19, 13, 20, 25, 35, 39),
end = c(8, 12, 18, 24, 16, 24, 30, 38, 45),
gene_id = c("A", "A", "A", "A", "A", "B", "B", "B", "B"),
exon_identity = c(NA, "Upstream", NA, "Downstream", "Event", NA, "Upstream", "Downstream", NA)
)
> test_df
start end gene_id exon_identity
1 2 8 A <NA>
2 9 12 A Upstream
3 13 18 A <NA>
4 19 24 A Downstream
5 13 16 A Event
6 20 24 B <NA>
7 25 30 B Upstream
8 35 38 B Downstream
9 39 45 B <NA>
For every unique value in gene_id
column, I would like to replace an entire row if it is present between "Upstream" and "Downstream" values in the exon_identity
column i.e. replace row 3 with row 5. What makes it difficult for me is that there are certain genes in the gene_id
column which do not have a row that needs to be replaced, e.g. "B" in the gene_id
column.
This question goes in the direction of previously asked questions here and here.
Based on those and other resources, I have tried:
library(tidyverse)
test_replace <- test_df %>%
group_by(gene_id) %>%
mutate(start = replace(start, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), start[exon_idnetity == "Event"]),
end = replace(end, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), end[exon_idnetity == "Event"]),
exon_idnetity = replace(exon_idnetity, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), "Event")
)
Warning message:
There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `start = replace(...)`.
ℹ In group 1: `gene_id = "A"`.
Caused by warning in `x[list] <- values`:
! number of items to replace is not a multiple of replacement length
ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning.
>
> test_replace
# A tibble: 9 × 4
# Groups: gene_id [2]
start end gene_id exon_idnetity
<dbl> <dbl> <chr> <chr>
1 2 8 A NA
2 9 12 A Upstream
3 NA NA A Event
4 19 24 A Downstream
5 13 16 A Event
6 20 24 B NA
7 25 30 B Upstream
8 35 38 B Downstream
9 39 45 B NA
Desired output:
> desired_outcome
start end gene_id exon_idnetity
1 2 8 A <NA>
2 9 12 A Upstream
3 13 16 A Event
4 19 24 A Downstream
5 20 24 B <NA>
6 25 30 B Upstream
7 35 38 B Downstream
8 39 45 B <NA>
A solution, preferably using tidyverse package would be highly appreciated.
Thank you!
In the toy example, reordering your data set gives you almost all of what you want. Will that work in the real data set? E.g.
library(tidyverse)
test_df |>
mutate(
sandwich = lag(exon_identity == 'Upstream') & lead(exon_identity == 'Downstream')
) |>
replace_na(list(sandwich = FALSE)) |>
group_by(gene_id) |>
arrange(start) |>
ungroup() |>
filter(!sandwich) |>
select(-sandwich)
(In the toy example, group_by
and ungroup
are not needed. I added them in case it was needed/useful in the real data set.)