Search code examples
rdplyrdata-manipulationdata-cleaning

How to replace an entire row between two rows based on a column


I am dealing a with a very large mRNA splicing dataset. Here is a toy dataset to exemplify the problem:

test_df <- data.frame(
  start = c(2, 9, 13, 19, 13, 20, 25, 35, 39),
  end = c(8, 12, 18, 24, 16, 24, 30, 38, 45),
  gene_id = c("A", "A", "A", "A", "A", "B", "B", "B", "B"),
  exon_identity = c(NA, "Upstream", NA, "Downstream", "Event", NA, "Upstream", "Downstream", NA)
)

> test_df
  start end gene_id exon_identity
1     2   8       A          <NA>
2     9  12       A      Upstream
3    13  18       A          <NA>
4    19  24       A    Downstream
5    13  16       A         Event
6    20  24       B          <NA>
7    25  30       B      Upstream
8    35  38       B    Downstream
9    39  45       B          <NA>

For every unique value in gene_id column, I would like to replace an entire row if it is present between "Upstream" and "Downstream" values in the exon_identity column i.e. replace row 3 with row 5. What makes it difficult for me is that there are certain genes in the gene_id column which do not have a row that needs to be replaced, e.g. "B" in the gene_id column.

This question goes in the direction of previously asked questions here and here.

Based on those and other resources, I have tried:

library(tidyverse)

test_replace <- test_df %>% 
  group_by(gene_id) %>% 
  mutate(start = replace(start, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), start[exon_idnetity == "Event"]),
         end = replace(end, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), end[exon_idnetity == "Event"]),
         exon_idnetity = replace(exon_idnetity, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), "Event")
         )


Warning message:
There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `start = replace(...)`.
ℹ In group 1: `gene_id = "A"`.
Caused by warning in `x[list] <- values`:
! number of items to replace is not a multiple of replacement length
ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning. 
> 
> test_replace
# A tibble: 9 × 4
# Groups:   gene_id [2]
  start   end gene_id exon_idnetity
  <dbl> <dbl> <chr>   <chr>        
1     2     8 A       NA           
2     9    12 A       Upstream     
3    NA    NA A       Event        
4    19    24 A       Downstream   
5    13    16 A       Event        
6    20    24 B       NA           
7    25    30 B       Upstream     
8    35    38 B       Downstream   
9    39    45 B       NA     

Desired output:


> desired_outcome 
  start end gene_id exon_idnetity
1     2   8       A          <NA>
2     9  12       A      Upstream
3    13  16       A         Event
4    19  24       A    Downstream
5    20  24       B          <NA>
6    25  30       B      Upstream
7    35  38       B    Downstream
8    39  45       B          <NA>

A solution, preferably using tidyverse package would be highly appreciated.

Thank you!


Solution

  • In the toy example, reordering your data set gives you almost all of what you want. Will that work in the real data set? E.g.

    library(tidyverse)
    test_df |>
      mutate(
        sandwich = lag(exon_identity == 'Upstream') & lead(exon_identity == 'Downstream')
      ) |>
      replace_na(list(sandwich = FALSE)) |>
      group_by(gene_id) |>
      arrange(start) |>
      ungroup() |>
      filter(!sandwich) |>
      select(-sandwich)
    

    (In the toy example, group_by and ungroup are not needed. I added them in case it was needed/useful in the real data set.)