I am currently writing a library that uses the -outfmt 10
option of Blast, which give you a CSV instead of the pretty human readable format.
Like
tblastn -db dmel_a -query somequery.faa -outfmt 10
The problem is, I want to access the db source file so I can extract some sequences after processing. The only way I know how to do this, is to use the remove -outfmt 10
and run the blast twice. Then I parse the human readable output for the line that says:
Database: Source.fas
But, that only works if title
is not specified when creating the database in makeblastdb
. The stitle
of outfmt 10
seems to be the fasta header line anyway. I cannot just look for the database name and then a .fna, .fas, .faa
because you can name the database differently than the source file.
Is there another way to extract the fasta source file from the blast database name? I do not see one in the list of outfmt
options. Or am I blind today?
Found a solution that worked based on a Biostar question, and a blasted bioinformatics blog post. Requires Blast+ 2.2.28 if your fasta doesnt follow NCBI naming exactly.
When you create the blast database, use the -parse_seqids
flag. Then with blastdbcmd, you can extract a range of the sequence
blastdbcmd -db t/blastTest/dmel -range 1-10 -entry some_seq_id