Search code examples
pythonrbashfasta

How can I randomly delete a set number of characters from a file?


I am trying to randomly remove 10%, 15% and 20% of the nucleotides in a fasta file.

So let's say I have a fasta file like this...

>GCA_900186885_1_000000000001
ATGCAAACATTTGTAAAAAACTTAATCGAT

I want to randomly choose and delete 10% of the nucleotides, which in this case would be 3, resulting in a fasta file with the same header, but with 3 fewer nucleotides:

>GCA_900186885_1_000000000001
ATGAAACATTGTAAAAACTTAATCGAT

The above is a simple example, which could easily be done manually, but I have a large fasta file with 2132142 nucleotides and thus want to generate three new fasta files using the original, but with 1918928, 1812321, and 1705714 nucleotides, representing a 10%, 15% and 20% reduction.

I have searched forums like stackoverflow and biostars for some related questions, but have not found anything useful.

I tried the following adaptation of a suggestion from another user to randomly delete lines from a file, but it didn't work.

filename=/Users/home/DETECTION/GCA_900186885.1_48903_D01_genomic_reformatted.fa
number=1918928

NT_count="$(grep -v ">" $filename | grep -E -o "G|C|T|A|N" | wc -l)"
NT_nums_to_delete="$(shuf -i "1-$NT_count" -n "$number")"
sed_script="$(printf '%dd;' $NT_nums_to_delete)"
sed -i.bak -e "$sed_script" "$filename"

The sed_script="$(printf '%dd;' $NT_nums_to_delete)" command resulted in the following error that I could not figure out.

zsh: bad math expression: operator expected at `2122589\n98...'

Any insight would be much appreciated. Thank you!


Solution

  • As you tagged bash, here's one in awk:

    $ awk -v s=$RANDOM 'BEGIN {
        if(s)                                      # if seed provided as var s
            srand(s)                               # use s as seed 
        else                                       # or not
            srand()
    }
    !/^>/ {                                        # process non > starting records
        i=int(0.1*(length($0)))                    # 10 % of length to remove
        for(j=1;j<=i;j++) {
            p=int(rand()*length($0))+1             # p the point from where to remove
            $0=substr($0,1,(p-1)) substr($0,(p+1)) # ie. skip p:s char 
        }
    }
    1' file                                        # output
    

    Sample output:

    >GCA_900186885_1_000000000001
    ATGCAAACATGTAAAAAACTTAACGAT