I have a protein sequence with 100 aminoacids (AA) that can be handled as a data.frame. Each AA has a position and for now all that matters is the position:
Protein <- data.frame(AA = 1:100)
Than I have a data.frame with peptides from the protein (after digestion / sequence breakdown) with Initial and Final position of the AA related to the protein:
df <- data.frame(
Peptides = c("Peptide_A", "Peptide_B", "Peptide_C", "Peptide_D"),
Initial.AA = c(1, 23, 59, 77),
Final.AA = c(18, 58, 70, 100)
)
Output:
Peptides Initial.AA Final.AA
1 Peptide_A 1 18
2 Peptide_B 23 58
3 Peptide_C 59 70
4 Peptide_D 77 100
Inspecting df it´s clear that some AA were not mapped (19:22 and 71:76, total of 10 unmapped AA).
I would like the have as output the total percentual of mapped AA, which in this example is 90% (90 mapped AA from all the peptides / 100 protein AA).
All answers are welcome as always, but tidyverse ones are prefered.
A base R approach using setdiff
(1 - length(setdiff(
Protein$AA,
unlist(apply(df[,2:3], 1, \(x)
seq(x["Initial.AA"], x["Final.AA"]))))) / nrow(Protein)) * 100
[1] 90
A dplyr alternative may be
library(dplyr)
df %>%
rowwise() %>%
reframe(AA = seq(Initial.AA, Final.AA)) %>%
summarize(total_mapped_AA = (1 - length(setdiff(Protein$AA, AA)) /
nrow(Protein)) * 100)
# A tibble: 1 × 1
total_mapped_AA
<dbl>
1 90