I have a bed file which contains DNA sequences information as follow:
**
track name="194" description="194 methylation (sites)" color=0,60,120 useScore=1
chr1 15864 15866 FALSE 894 +
chr1 534241 534243 FALSE 921 -
chr1 710096 710098 FALSE 729 +
chr1 714176 714178 FALSE 12 -
chr1 720864 720866 FALSE 988 -
**
I loaded the bed file in R and named the matrix DataSet. I used the follow code to get the sequences:
mydataSet_Test1<-dataSet[,1:3]
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
chr<-as.matrix(as.character(mydataSet_Test1[,1]))
#50
start<-as.matrix(as.integer(as.character(mydataSet_Test1[,2]))-50)
end<-as.matrix(as.integer(as.character(mydataSet_Test1[,3]))+50)
Seqs50_Test1<-getSeq(genome,chr,start=start,end=end)
Now, Seqs50_Test1 is Large DNAStringSet. I want now to load the BioSeqClass package in R, to do a homolog reduction in my sequences.
I want to use the hr() function, which, according to the package manual, is like this:
Description
Filter homolog sequences by sequence similarity. hr(seq, method, identity, cdhit.path)
Arguments
seq a list with one element for each protein/gene sequence. The elements are in two parts, one the description ("desc") and the second is a character string of the biological sequence ("seq").
identity a numeric value ranged from 0 to 1. It is used as a maximum identity cutoff among input sequences.
method a string for the method of homolog redunction. This must be one of the strings "cdhit" or "aligndis".
My question is how can I convert my DNAStringSet to the list of elements the function hr() wants? I tried using the list() function but when I run the hr() function, it gives me an error Error in seq[1][["seq"]] : subscript out of bounds
FULL CODE:
mydataSet = dataSet[,1:3]
library(BSgenome.Hsapiens.UCSC.hg19)
genome = BSgenome.Hsapiens.UCSC.hg19
chr = as.matrix(as.character(mydataSet[,1]))
start = as.matrix(as.integer(as.character(mydataSet[,2]))-200)
end = as.matrix(as.integer(as.character(mydataSet[,3]))+200)
Seqs = getSeq(genome,chr,start=start,end=end)
writeXStringSet(Seqs, "C:\\Users\\JL009\\Desktop\\Seqs.fasta", append=FALSE, format = "fasta")
#if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#BiocManager::install("BioSeqClass")
library(BioSeqClass)
library(Biostrings)
seq = as.character(readAAStringSet("C:\\Users\\JL009\\Desktop\\Seqs.fasta"))
reducSeqs = hr(seq, method="aligndis", identity=0.4)
Try something like this then:
library(BioSeqClass)
library(Biostrings)
library(BSgenome.Hsapiens.UCSC.hg19)
gr = GRanges(seqnames="chr1",IRanges(start=seq(10e6,11e6,length.out=10),width=50))
S = getSeq(BSgenome.Hsapiens.UCSC.hg19,gr)
names(S) = paste("seq",1:length(S))
input = lapply(seq_along(S),function(i)list(desc=names(S)[i],seq=as.character(S[[i]])))
hr(input,method="aligndis",identity=0.5)