Search code examples
pythonfor-loopnested-loopsfasta

Workaround for index out of range while searching through FASTA file


I'm working on a program that lets the user enter a sequence they want to find inside a FASTA file, after which the program shows the description line and the sequence that belongs to it. The FASTA can be found at hugheslab.ccbr.utoronto.ca/supplementary-data/IRC/IRC_representative_cdna.fa.gz, it's approx. 87 MB.

The idea is to first create a list with the location of description lines, which always start with a >. Once you know what are the description lines, you can search for the search_term in the lines between two description lines. This is exactly what is done in the fourth paragraph, this results in a list of 48425 long, here is an idea of what the results are: http://imgur.com/Lxy8hnI

Now the fifth paragraph is meant to search between two description lines, let's take lines 0 and 15 as example, this would be description_list[a] and description_list[a+1] as a = 0 and a+1 = 1, and description_list[0] = 0 and description_list[1] = 15. Between these lines the if-statement searches for the search term, if it finds one it will save description_list[a] into the start_position_list and description_list[a+1] into the stop_position_list, which will be used later on.

So as you can imagine a simple term like 'ATCG' will occur often, which means the start_position_list and stop_position_list will have a lot of duplicates, which will be removed using list(set(start_position_list)) and afterwards sorting them. That way start_position_list[0] and start_position_list[0] will be 0 and 15, like this: https://i.sstatic.net/L9oek.jpg, which can then be used as a range for which lines to print out to show the sequence.

Now, of course, the big issue is that line 15, for i in range(description_list[a], description_list[a+1]): will eventually hit the [a+1] while it's already at the maximum length of description_list and therefore will give a list index out of range error, as you see here as well: https://i.sstatic.net/bvpeg.jpg

What would be the best solution for this ? It's still necessary to go through all the description lines and I can't come up with a better structure to go through them all ?

file = open("IRC_representative_cdna.fa")
file_list = list(file)

search_term = input("Enter your search term: ")

description_list = []
start_position_list = []
stop_position_list = []

for x in range (0, len(file_list)):
    if ">" in file_list[x]:
        description_list.append(x)

for a in range(0, len(description_list)):
        for i in range(description_list[a], description_list[a+1]):
            if search_term in file_list[i]:
                start_position_list.append(description_list[a])
                stop_position_list.append(description_list[a+1])

Solution

  • The way to avoid the subscript out of range error is to shorten the loop. Replace the line

    for a in range(0, len(description_list)):
    

    by

    for a in range(0, len(description_list)-1):
    

    Also, I think that you can use a list comprehension to build up description_list:

    description_list = [x for x in file_list if x.startswith('>')]
    

    in addition to being shorter it is more efficient since it doesn't do a linear search over the entire line when only the starting character is relevant.