Search code examples
rbioinformaticsfastabioconductor

Random subsequence fasta sequence and change sequence name


I have a fasta file (fas2), which has about 1000 fasta sequences and here are couple example of fasta sequence:

>gi|108863165-BAdV-2
ATGGCTACTCCTTCGATGATGCCGCAGTGGTCTTACATGCACATCGCCGGGCAGGATGCCTCCGA
>gi|108863163-BAdV-1
ATGGCGACGCCGTCGATGATGCCCCAGTGGTCGTACATGCACATCGCCGGGCAGGATGCCTCAGA

I looked up online and see many tutorials use readDNAStringSet to read the fasta file. So I use this comment to read my file:

fas3 <- readDNAStringSet(fas2, "fasta")

It does create a data.frame like structure (but it is not data.frame) to view the fasta file. My questions are whether there is any function in R I can randomly sample 500 fasta sequences out of fas3? Also, if I want to rename a specific fasta name such as (gi|108863165-BAdV-2 to BAdV-2), how do I do that? Thanks in advance!!


Solution

  • fas3 follows a 'vector-like' interface, rather than a data.frame, so you could sample sequences by generating 500 numbers from the length of the object, and using those to subset

    fas3.subset = fas3[sample(length(fas3), 500)]
    

    Use the accessor names()<- to update names, e.g.,

     names(fas3) = sub("gi|108863165-", "", names(fas3)
    

    This is illustrated on the help page ?DNAStringSet. See also the Bioconductor support site for a more appropriate forum for questions about Bioconductor packages.