Sequence coverage using R

I have a protein sequence with 100 aminoacids (AA) that can be handled as a data.frame. Each AA has a position and for now all that matters is the position:

Protein <- data.frame(AA = 1:100)

Than I have a data.frame with peptides from the protein (after digestion / sequence breakdown) with Initial and Final position of the AA related to the protein:

df <- data.frame(
Peptides = c("Peptide_A", "Peptide_B", "Peptide_C", "Peptide_D"), 
Initial.AA = c(1, 23, 59, 77), 
Final.AA = c(18, 58, 70, 100)
)

Output:

   Peptides Initial.AA Final.AA
1 Peptide_A          1       18
2 Peptide_B         23       58
3 Peptide_C         59       70
4 Peptide_D         77      100

Inspecting df it´s clear that some AA were not mapped (19:22 and 71:76, total of 10 unmapped AA).

I would like the have as output the total percentual of mapped AA, which in this example is 90% (90 mapped AA from all the peptides / 100 protein AA).

All answers are welcome as always, but tidyverse ones are prefered.

Solution

A base R approach using setdiff

(1 - length(setdiff(
       Protein$AA, 
       unlist(apply(df[,2:3], 1, \(x) 
                seq(x["Initial.AA"], x["Final.AA"]))))) / nrow(Protein)) * 100
[1] 90

A dplyr alternative may be

library(dplyr)

df %>% 
  rowwise() %>% 
  reframe(AA = seq(Initial.AA, Final.AA)) %>% 
  summarize(total_mapped_AA = (1 - length(setdiff(Protein$AA, AA)) / 
    nrow(Protein)) * 100)
# A tibble: 1 × 1
  total_mapped_AA
            <dbl>
1              90