Search code examples
rfastagenetics

Is there an R function that can find (like count) and give position of string?


I am looking for code in Rstudio that will search for Recombination signal sequences (RSS) within a DNA sequence and give me the position. I used count function however it does not give me position and I cannot input multiple RSSs into it at the same time which makes it very tedious. For instance:

cd247<-read.fasta("sequence.fasta")
#store DNA sequence for cd247 into cd247seq
cd247seq<-cd247[[1]]
cd247seq[1:50]
#change cd247seq from vector of characters to string
c2s(cd247seq)
#create table of all the 7 character sequences in cd247 gene
count(cd247seq,7)
cd247table<-count(cd247seq,7)
cd247table[["tactgtg"]]

outputs

cd247table[["tactgtg"]]

[1] 1

but not position within cd247seq

I have posted my files to github https://github.com/opheelorraine/Map-RSS


Solution

  • Have a look at the Biostrings package, in particular the matchPDict function. See https://kasperdanielhansen.github.io/genbioconductor/html/Biostrings_Matching.html

    Example:

    suppressPackageStartupMessages(library(Biostrings))
    dnaseq <- DNAString("ATAGCCATGATGATTTAACCAGGTCATTT") # the sequence
    motif <- DNAStringSet(c("ATG", "TGA")) # the motifs
    pos <- matchPDict(motif, dnaseq)
    pos
    #> MIndex object of length 2
    #> [[1]]
    #> IRanges object with 2 ranges and 0 metadata columns:
    #>           start       end     width
    #>       <integer> <integer> <integer>
    #>   [1]         7         9         3
    #>   [2]        10        12         3
    #> 
    #> [[2]]
    #> IRanges object with 2 ranges and 0 metadata columns:
    #>           start       end     width
    #>       <integer> <integer> <integer>
    #>   [1]         8        10         3
    #>   [2]        11        13         3
    start(pos) # ATG starts at pos 7 and 10; TGA at 8 and 11 
    #> IntegerList of length 2
    #> [[1]] 7 10
    #> [[2]] 8 11
    countPDict(motif, dnaseq)
    #> [1] 2 2
    
    # search reverse complement:
    matchPDict(reverseComplement(motif), dnaseq)
    #> MIndex object of length 2
    #> [[1]]
    #> IRanges object with 2 ranges and 0 metadata columns:
    #>           start       end     width
    #>       <integer> <integer> <integer>
    #>   [1]         6         8         3
    #>   [2]        25        27         3
    #> 
    #> [[2]]
    #> IRanges object with 1 range and 0 metadata columns:
    #>           start       end     width
    #>       <integer> <integer> <integer>
    #>   [1]        24        26         3
    

    Created on 2020-08-05 by the reprex package (v0.3.0)