Search code examples
rstringreplacedna-sequence

excluding specific strings (DNA string) from background (DNA sequence) and shuffling (i.e. generating negative set from positive DNA sequence)


I have fasta file including strings of DNA. I want to generate a negative dataset from positive data. One way is to exclude some specific sequences from my data and then shuffle the data.
Let's say my dataset is a list:

1)
DNAlst:
ACTATACGCTAATATCGATCTACGTACGATCG
CAGCAGCAGCGAGACTATCCTACCGCA
ATATCGATCGCAAAAATCG

I want to exclude these sequences:

ATAT,CGCA

so the result would be:

ACTATACGCTACGATCTACGTACGATCG
CAGCAGCAGCGAGACTATCCTAC
CGATAAAAATCG

2) then I want to shuffle my sequence by a specific length (e.g. 5). It means to shuffle DNA string by part (5-mer) with length of 5. For example:

ATATACGCGAAAAAATCTCTC => result after shuffle by 5 ==> AAAAACTCTCCGCAATATA

I would be thankful you if tell me how to do this in R.


Solution

  • use stringi package:

    dna <- c("ACTATACGCTAATATCGATCTACGTACGATCG","CAGCAGCAGCGAGACTATCCTACCGCA","ATATCGATCGCAAAAATCG")
    
    # stri_replace function replaces strings ATAT and CGCA for empty string
    stri_replace_all_regex(dna, "ATAT|CGCA","")
    

    Now the shuffle part. seq and stri_sub functions will be useful. First we need to 'cut' our DNA seq into pieces of at most 5 char long. seq function give us starting points

    seq(1,24,5)
    ## [1]  1  6 11 16 21
    seq(1,27,5)
    ## [1]  1  6 11 16 21 26 
    

    stri_sub string from indexes generated by seq of length 5

    y <- stri_sub(dna[1], seq(from=1,to=stri_length(dna[1]),by=5), length = 5)
    y
    ## [1] "ACTAT" "ACGCT" "AATAT" "CGATC" "TACGT" "ACGAT" "CG"   
    

    sample will shuffle our vector and stri_flatten paste it together into one string.

    stri_flatten(y[sample(length(y))])
    ## [1] "TACGTACGATCGATCAATATACGCTACTATCG"