I have a fasta file with about 8,000 sequences in it. I need to change the identifier line name to a random unique shorten name (max length 10). The fasta file contains seqences like this.
>AX039539.1.1212 Bacteria;Chloroflexi;Dehalococcoidia;Dehalococcoidales;
GAUGAACGCUAGCGGCGUGCCUUAUGCAUGCAAGUCGAACGGUCUUAAGCAAUUAAGAUAGUGGCAAACGGGUGAGUAACGCGUAAGUAACCUACCUCUAAGUGGGGGAUAGCUUCGGGAAACUGAAGGUAAUACCGCAUGUGGUGGGCCGACAUAAGUUGGUUCACUAAAGCCGUAAGGUGCUUGGUGAGGGGCUUGCGUCCGAUUAGCUAGUUGGUGGGGUAACGGCCUACCAAGGCUUCGAUCGGUAGCUGGUCUGAGAGGAUGAUCAGCCACACUGGGACUGAGACACGGCCCAGACUCCUACGGGAG
Here is my script so far:
use strict;
use warnings;
#change ID line name to random unique shorten (max 10 characters) string
open (my $fh,"$ARGV[0]") or die "Failed to open file: $!\n";
open (my $out_fh, ">$ARGV[0]_shorten_ID.fasta");
my $string;
while(<$fh>) {
for (0..9) { $string .= chr( int(srand(rand(25) + 65) )); }
if ($_ =~ s/^>*.+\n/>$string/){ # change header FASTA header
print $out_fh "$_";
}
}
close $fh;
close $out_fh;
I have been trying this but it starts with 10 characters then adds 10 more on as goes down and I lose the sequence. I realize there are similar question already but it is slightly different, I need to randomly generate unique shortened names.
Your problem can simply be fixed by resetting $string
to an empty string just inside the while
loop. But this is needlessly complex (and also inefficient -- you generate and throw away random identifiers when you are not looking at a line starting with >
); I would go with just
perl -pe 'BEGIN { srand(time()); }
s/>.*/ ">" . join ("", map { chr(rand(25)+65) } 0..9) /e' file.fasta
If you do not absolutely require properly pseudorandom identifiers, maybe go with just
perl -pe 'BEGIN { $id = "a" x 7 } s/>.*/">" . $id++/e' file.fasta
which produces identifiers like "aaaaaaa", "aaaaaab", etc. (I went for seven-character identifiers but four characters would be more than enough for 8,000 unique id:s; you'd end at "alvr".)