I have fasta file including strings of DNA. I want to generate a negative dataset from positive data. One way is to exclude some specific sequences from my data and then shuffle the data.
Let's say my dataset is a list:
1)
DNAlst:
ACTATACGCTAATATCGATCTACGTACGATCG
CAGCAGCAGCGAGACTATCCTACCGCA
ATATCGATCGCAAAAATCG
I want to exclude these sequences:
ATAT,CGCA
so the result would be:
ACTATACGCTACGATCTACGTACGATCG
CAGCAGCAGCGAGACTATCCTAC
CGATAAAAATCG
2)
then I want to shuffle my sequence by a specific length (e.g. 5). It means to shuffle DNA string by part (5-mer) with length of 5. For example:
ATATACGCGAAAAAATCTCTC => result after shuffle by 5 ==> AAAAACTCTCCGCAATATA
I would be thankful you if tell me how to do this in R.
use stringi
package:
dna <- c("ACTATACGCTAATATCGATCTACGTACGATCG","CAGCAGCAGCGAGACTATCCTACCGCA","ATATCGATCGCAAAAATCG")
# stri_replace function replaces strings ATAT and CGCA for empty string
stri_replace_all_regex(dna, "ATAT|CGCA","")
Now the shuffle part. seq
and stri_sub
functions will be useful. First we need to 'cut' our DNA seq into pieces of at most 5 char long. seq function give us starting points
seq(1,24,5)
## [1] 1 6 11 16 21
seq(1,27,5)
## [1] 1 6 11 16 21 26
stri_sub
string from indexes generated by seq
of length 5
y <- stri_sub(dna[1], seq(from=1,to=stri_length(dna[1]),by=5), length = 5)
y
## [1] "ACTAT" "ACGCT" "AATAT" "CGATC" "TACGT" "ACGAT" "CG"
sample
will shuffle our vector and stri_flatten
paste it together into one string.
stri_flatten(y[sample(length(y))])
## [1] "TACGTACGATCGATCAATATACGCTACTATCG"