Search code examples
rbioinformaticsgeneticssequence-alignment

How to convert from seqinr SeqFastadna object to Biostrings DNAStringSet for multiple sequence alignment in R


I am working with DNA sequence data in fasta files, and have to work only in R for this project. I do some manipulations using the seqinr package (selecting a subset of sequences, altering the fasta headers etc). For the next stage in the analysis I want to do a multiple sequence alignment, and have used the msa R package. I can get msa working if I import a fasta file, but I'm struggling to move directly within R from the seqinr list object to the Biostrings DNAStringSet object that I have used as input for msa.

Example data - Assume that fasta_file.fasta is a fasta file with contents as follows:

>seq1
ATATATAT
>seq2
CGCGCGCG
>seq3
ATATCGCG
>seq4
ATATATAT

The code I've used is:

# Load packages
library(tidyverse)
library(seqinr)
library(msa)
library(adegenet)
library(bios2mds)
library(ape)
library(ggtree)

# Import sequences (using seqinr)
sequences <- read.fasta("fasta_file.fasta")

# Define sequences for selection
seqs_select <- c("seq1", "seq2", "seq3")

# Select chosen sequences
seq_sub <- sequences[names(sequences) %in% seqs_select]

# Check the number of sequences left
length(sequences) # 4 original
length(seq_sub) # 3 after selection

# Run multiple sequence alignment using msa
seq_alignment <- msa(seq_sub, method="ClustalOmega") # Generates an error message because seq_sub is the wrong input

msa works if I import the fasta file straight away as a Biostrings DNAstringset object:

# Import fasta file directly as Biostrings object
seq_dnastring <- readDNAStringSet("fasta_file.fasta") 
seq_alignment <- msa(seq_dnastring, method="ClustalOmega")

I could save the processed fasta file I make using seqinr and then re-load it using readDNAStringSet, but my question is whether there is a way to convert directly from seq_sub to something that msa can use as input to run an alignment with. ie. to turn seq_sub format into seq_dnastring format. Thanks for the help.


Solution

  • Another option is to process your sequences in Biostrings, instead of seqinr. Nonetheless, I think this does the trick

    library(Biostrings)
    library(seqinr)
    FUN = function(x)
        paste(getSequence(x), collapse = "")
    as(vapply(sequences, FUN, character(1)), "DNAStringSet")