I generated bowtie index files using bowtie-index in a bash script as follows:
bowtie-build $FA_FILE $OUTPUT_BASE
(script can be found here: https://github.com/kennethphough/bioinformatics/blob/master/sge/sge_build_index)
I want each node of my cluster to align my sequence files to a chromosome and not to the entire genome. So in theory if I run an instance of tophat for each chromosome for the same sequence file on each node, it should be faster than running tophat on one node for the entire genome.
I made sure that the location of my bowtie index files was exported like so:
export BOWTIE_INDEXES="$(dirname ${EBWT})/"
and then execute tophat like so:
tophat -p 4 -G $GTF -o $OBASE $Chr $FASTQ
$GTF
contains that path to the annotation file, $Chr
contains the file name of the index file (excluding the file extension .ebwt) and $FASTQ
contains the path to my sequence read file.
(script can be found here: https://github.com/kennethphough/bioinformatics/blob/master/sge/sge_tophat)
When I run the script I get an error saying bowtie index could not be foudn. Excerpt below:
[Sun Oct 5 15:08:48 2014] Beginning TopHat run (v1.1.2)
-----------------------------------------------
[Sun Oct 5 15:08:48 2014] Preparing output location /home/kennethphough/GSE58365/fast/chr11_gl000202_random.1/
[Sun Oct 5 15:08:48 2014] Checking for Bowtie index files
Error: Could not find Bowtie index files /home/kennethphough/genome/hg19/chr11_gl000202_random.1.*
The bowtie index file in question for the above error is chr11_gl000202_random.1.ebwt
which I have confirmed that it's there. Any lead on what's going wrong will be greatly appreciated.
Bowtie version is 0.12.7 Tophat version is 1.1.2
The issue was that there is more than one file for a bowtie index. So for the example above chr11_gl000202_random
has:
chr11_gl000202_random.1.ebwt
chr11_gl000202_random.2.ebwt
chr11_gl000202_random.3.ebwt
chr11_gl000202_random.rev.1.ebwt
chr11_gl000202_random.rev.1.ebwt
so instead of passing on the file name without the extension I needed to get the chromosome sequence name like so:
Chr=`echo "$FNAME" | awk -F. '{print $1}'`
I've update my script at github to reflect the changes.