Search code examples
rbioinformaticsbioconductor

Identifying amino acid substitutions from local alignments in R


I want to identify the position and amino acid change in a specific region of interest in my sequences, and store that information in a table.

Is it possible to do something like this in R, by maybe using the bioconductor package? I have managed to do a simple alignment of the sequences with AlignSeqs() from the DECIPHER package, but I am unable to pull out the differences in the sequences automatically. I start with FASTA files.

I want to end up with something like this:

Isolate ID    Reference_AA   Sample_AA   Pos
1             S              T           254
2             T              D           200
3             L              A           230

I have a reference sequence which is 74 AA long, and I want to look at differences in the query sequences (which are much longer than the reference) compared to the reference, and list the results in a table. The position column is related to the position in the reference sequence and not in the query sequence. I want the first AA in the reference sequence to start at 68, not 1.

I find it difficult to add example sequences for this as they tend to be very long, but here is something shorter to work with (not related to the table above):

>ref
VGRALPDVR

>query1
KSSYLDYAMSVIVGTALPDVRDGLKPVHRRVLY

>query2
ELKSSYLDYAMSVIVGRAAPDVRDGLKPV

Expected output:

ID       Reference_AA    Sample_AA   Pos
query1   R               T           70
query2   A               L           72

Solution

  • Since you want the position in the reference, you can just use a series of pair-wise alignments to the reference sequence. biostrings includes a mismatchTable function that will give you data frame with all the info you need. With a bit of reformatting using dplyr:

    library(Biostrings)
    library(dplyr)
    
    seqs<-readAAStringSet("test.fa")
    
    mismatches <- function(query, ref) {
        pairwiseAlignment(ref, query, substitutionMatrix = "BLOSUM50",
                          gapOpening = 3, gapExtension = 1) %>%
          mismatchTable() %>%
          mutate(ID=names(query), 
                 Pos=PatternStart+67, 
                 Reference_AA=as.character(PatternSubstring),
                 Sample_AA=as.character(SubjectSubstring)) %>% 
                 select(ID, Reference_AA, Sample_AA, Pos)
    }  
    
    bind_rows(mismatches(seqs[2], seqs[1]), mismatches(seqs[2], seqs[1]))
    
    #>      ID Reference_AA Sample_AA Pos
    #>1 query1            R         T  70
    #>2 query2            L         A  72
    

    EDIT

    Here's how to loop over inputs using lapply:

    bind_rows(lapply(seq_along(seqs[-1]), function(i) mismatches(seqs[i+1], seqs[1])))
    #>   ID Reference_AA Sample_AA Pos
    #>1 ref            R         T  70
    #>2 ref            L         A  72