Search code examples
rbioconductordna-sequence

Convert DNAStringSet to a list of elements in R? (Error in seq[[1]][["seq"]] : subscript out of bounds in R)


I have a bed file which contains DNA sequences information as follow:

**

track name="194" description="194 methylation (sites)" color=0,60,120 useScore=1
chr1    15864   15866   FALSE   894 +
chr1    534241  534243  FALSE   921 -
chr1    710096  710098  FALSE   729 +
chr1    714176  714178  FALSE   12  -
chr1    720864  720866  FALSE   988 -

**

I loaded the bed file in R and named the matrix DataSet. I used the follow code to get the sequences:

mydataSet_Test1<-dataSet[,1:3]

library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
chr<-as.matrix(as.character(mydataSet_Test1[,1]))

#50
start<-as.matrix(as.integer(as.character(mydataSet_Test1[,2]))-50)
end<-as.matrix(as.integer(as.character(mydataSet_Test1[,3]))+50)
Seqs50_Test1<-getSeq(genome,chr,start=start,end=end)

Now, Seqs50_Test1 is Large DNAStringSet. I want now to load the BioSeqClass package in R, to do a homolog reduction in my sequences.

I want to use the hr() function, which, according to the package manual, is like this:

Description

Filter homolog sequences by sequence similarity. hr(seq, method, identity, cdhit.path)

Arguments

seq a list with one element for each protein/gene sequence. The elements are in two parts, one the description ("desc") and the second is a character string of the biological sequence ("seq").

identity a numeric value ranged from 0 to 1. It is used as a maximum identity cutoff among input sequences.

method a string for the method of homolog redunction. This must be one of the strings "cdhit" or "aligndis".

My question is how can I convert my DNAStringSet to the list of elements the function hr() wants? I tried using the list() function but when I run the hr() function, it gives me an error Error in seq[1][["seq"]] : subscript out of bounds

FULL CODE:

mydataSet = dataSet[,1:3]

library(BSgenome.Hsapiens.UCSC.hg19)
genome = BSgenome.Hsapiens.UCSC.hg19
chr = as.matrix(as.character(mydataSet[,1]))
start = as.matrix(as.integer(as.character(mydataSet[,2]))-200)
end = as.matrix(as.integer(as.character(mydataSet[,3]))+200)
Seqs = getSeq(genome,chr,start=start,end=end)
writeXStringSet(Seqs, "C:\\Users\\JL009\\Desktop\\Seqs.fasta", append=FALSE, format = "fasta")

#if (!requireNamespace("BiocManager", quietly = TRUE))
#  install.packages("BiocManager")

#BiocManager::install("BioSeqClass")

library(BioSeqClass)
library(Biostrings)
seq = as.character(readAAStringSet("C:\\Users\\JL009\\Desktop\\Seqs.fasta"))
reducSeqs = hr(seq, method="aligndis", identity=0.4)

Solution

  • Try something like this then:

    library(BioSeqClass)
    library(Biostrings)
    library(BSgenome.Hsapiens.UCSC.hg19)
    
    gr = GRanges(seqnames="chr1",IRanges(start=seq(10e6,11e6,length.out=10),width=50))
    S = getSeq(BSgenome.Hsapiens.UCSC.hg19,gr)
    names(S) = paste("seq",1:length(S))
    
    input = lapply(seq_along(S),function(i)list(desc=names(S)[i],seq=as.character(S[[i]])))
    
    hr(input,method="aligndis",identity=0.5)