I am trying to randomly remove 10%, 15% and 20% of the nucleotides in a fasta file.
So let's say I have a fasta file like this...
>GCA_900186885_1_000000000001
ATGCAAACATTTGTAAAAAACTTAATCGAT
I want to randomly choose and delete 10% of the nucleotides, which in this case would be 3, resulting in a fasta file with the same header, but with 3 fewer nucleotides:
>GCA_900186885_1_000000000001
ATGAAACATTGTAAAAACTTAATCGAT
The above is a simple example, which could easily be done manually, but I have a large fasta file with 2132142 nucleotides and thus want to generate three new fasta files using the original, but with 1918928, 1812321, and 1705714 nucleotides, representing a 10%, 15% and 20% reduction.
I have searched forums like stackoverflow and biostars for some related questions, but have not found anything useful.
I tried the following adaptation of a suggestion from another user to randomly delete lines from a file, but it didn't work.
filename=/Users/home/DETECTION/GCA_900186885.1_48903_D01_genomic_reformatted.fa
number=1918928
NT_count="$(grep -v ">" $filename | grep -E -o "G|C|T|A|N" | wc -l)"
NT_nums_to_delete="$(shuf -i "1-$NT_count" -n "$number")"
sed_script="$(printf '%dd;' $NT_nums_to_delete)"
sed -i.bak -e "$sed_script" "$filename"
The sed_script="$(printf '%dd;' $NT_nums_to_delete)" command resulted in the following error that I could not figure out.
zsh: bad math expression: operator expected at `2122589\n98...'
Any insight would be much appreciated. Thank you!
As you tagged bash, here's one in awk:
$ awk -v s=$RANDOM 'BEGIN {
if(s) # if seed provided as var s
srand(s) # use s as seed
else # or not
srand()
}
!/^>/ { # process non > starting records
i=int(0.1*(length($0))) # 10 % of length to remove
for(j=1;j<=i;j++) {
p=int(rand()*length($0))+1 # p the point from where to remove
$0=substr($0,1,(p-1)) substr($0,(p+1)) # ie. skip p:s char
}
}
1' file # output
Sample output:
>GCA_900186885_1_000000000001
ATGCAAACATGTAAAAAACTTAACGAT