Search code examples
rdplyrtidyversesequence

Sequence coverage using R


I have a protein sequence with 100 aminoacids (AA) that can be handled as a data.frame. Each AA has a position and for now all that matters is the position:

Protein <- data.frame(AA = 1:100)

Than I have a data.frame with peptides from the protein (after digestion / sequence breakdown) with Initial and Final position of the AA related to the protein:

df <- data.frame(
Peptides = c("Peptide_A", "Peptide_B", "Peptide_C", "Peptide_D"), 
Initial.AA = c(1, 23, 59, 77), 
Final.AA = c(18, 58, 70, 100)
)

Output:

   Peptides Initial.AA Final.AA
1 Peptide_A          1       18
2 Peptide_B         23       58
3 Peptide_C         59       70
4 Peptide_D         77      100

Inspecting df it´s clear that some AA were not mapped (19:22 and 71:76, total of 10 unmapped AA).

I would like the have as output the total percentual of mapped AA, which in this example is 90% (90 mapped AA from all the peptides / 100 protein AA).

All answers are welcome as always, but tidyverse ones are prefered.


Solution

  • A base R approach using setdiff

    (1 - length(setdiff(
           Protein$AA, 
           unlist(apply(df[,2:3], 1, \(x) 
                    seq(x["Initial.AA"], x["Final.AA"]))))) / nrow(Protein)) * 100
    [1] 90
    

    A dplyr alternative may be

    library(dplyr)
    
    df %>% 
      rowwise() %>% 
      reframe(AA = seq(Initial.AA, Final.AA)) %>% 
      summarize(total_mapped_AA = (1 - length(setdiff(Protein$AA, AA)) / 
        nrow(Protein)) * 100)
    # A tibble: 1 × 1
      total_mapped_AA
                <dbl>
    1              90