I want to BLASTn several sequences against a reference genome using NCBI BLAST+ and output only the line with the accession number, E-value, and other information on it from the BLAST+ output (because there are several extraneous lines from the BLAST+ output) to a csv. I have these files:
Text file with accession numbers for human gene sequences, one per line: GSEA-toBLASTaccession.txt
Reference genome: botznik-chr.fa
Output csv: GSEABLAST.csv
Here is the code I have written to execute this:
for acc in `cat GSEA-toBLASTaccession.txt`; do
echo $acc | blastn -db botznik-chr.fa -out GSEABLAST.out -num_alignments 1 \
-outfmt "6 qacc evalue qstart qend sstart send bitscore score length pident \
nident ppos positive mismatch gapopen" >> GSEABLAST.csv
done
I am not getting the results I need from this; what do I need to tweak to get a CSV with the accession number, E-value, query start/end, sequence start/end, bitscore, score, length, identity %/number, positive %/number, mismatch, and gap/open on it for a BLAST for each of the genes in my list of accession numbers?
You don't describe the output that you do get so it requires some guesswork to figure out what the problem is. I suspect that you're expecting the output for all sequences to be in GSEABLAST.csv
but instead you're getting the output from only the last sequence in GSEABLAST.out
If that is the problem, it's because -out GSEABLAST.out
causes the output to be sent to the specified file rather than to STDOUT. This file gets overwritten with every iteration of the for
loop. If you remove that part of the command, the output will go to STDOUT which is then appended to GSEABLAST.csv