Search code examples
bashbioinformaticssungridengine

TopHat could not find bowtie index files even though path is correct


I generated bowtie index files using bowtie-index in a bash script as follows:

bowtie-build $FA_FILE $OUTPUT_BASE

(script can be found here: https://github.com/kennethphough/bioinformatics/blob/master/sge/sge_build_index)

I want each node of my cluster to align my sequence files to a chromosome and not to the entire genome. So in theory if I run an instance of tophat for each chromosome for the same sequence file on each node, it should be faster than running tophat on one node for the entire genome.

I made sure that the location of my bowtie index files was exported like so:

export BOWTIE_INDEXES="$(dirname ${EBWT})/"

and then execute tophat like so:

tophat -p 4 -G $GTF -o $OBASE $Chr $FASTQ

$GTF contains that path to the annotation file, $Chr contains the file name of the index file (excluding the file extension .ebwt) and $FASTQ contains the path to my sequence read file.

(script can be found here: https://github.com/kennethphough/bioinformatics/blob/master/sge/sge_tophat)

When I run the script I get an error saying bowtie index could not be foudn. Excerpt below:

[Sun Oct  5 15:08:48 2014] Beginning TopHat run (v1.1.2)
-----------------------------------------------
[Sun Oct  5 15:08:48 2014] Preparing output location /home/kennethphough/GSE58365/fast/chr11_gl000202_random.1/
[Sun Oct  5 15:08:48 2014] Checking for Bowtie index files
Error: Could not find Bowtie index files /home/kennethphough/genome/hg19/chr11_gl000202_random.1.*

The bowtie index file in question for the above error is chr11_gl000202_random.1.ebwt which I have confirmed that it's there. Any lead on what's going wrong will be greatly appreciated.

Bowtie version is 0.12.7 Tophat version is 1.1.2


Solution

  • The issue was that there is more than one file for a bowtie index. So for the example above chr11_gl000202_random has:

    chr11_gl000202_random.1.ebwt
    chr11_gl000202_random.2.ebwt
    chr11_gl000202_random.3.ebwt
    chr11_gl000202_random.rev.1.ebwt
    chr11_gl000202_random.rev.1.ebwt
    

    so instead of passing on the file name without the extension I needed to get the chromosome sequence name like so:

    Chr=`echo "$FNAME" | awk -F. '{print $1}'`
    

    I've update my script at github to reflect the changes.