Search code examples
pythonstringsearchblast

Use python to reduce number of hits in tabular blast output file


I have a large blast files in tabular format, for which the number of target sequences weren't limited, so it takes a long time to parse. I want to reduce the number of hits for each query sequence to the first 10. My python is basic but here's what I have so far

import sys

blastfile = open(sys.argv[1],"r")

column1list=[]

for line in blastfile:
    b = line.split()[0]
    column1list.append(b)

uniqcolumn1 = list(set(column1list))

counter = 0

for val in uniqcolumn1:
    #print val
    for line in blastfile:
        #print line
        while counter <= 10:
            if line.startswith(val):
                print line
                counter =+ 1

Here's an example of one line of the blast output file, the title of the query sequence is in the first column, in this case 'c8208_g1_i2'

c8208_g1_i2 gi|851252702|ref|WP_048131971.1|    79.30   797 165 0   4881    2491    1   797 0.0 1336    acetyl-CoA decarbonylase/synthase complex subunit alpha [Methanosaeta concilii]

I think the first part of the code is working OK, up to ' uniqcolumn1 = list(set(column1list))', then I can't get it to print the first ten lines starting with each string in the list.


Solution

  • The problem here seems to be that you are iterating through your file object twice. In Python, file objects work a lot like pointers that read through every line. If you do not move the pointer back, it has nothing to read.

    What you would have to do is use the .seek function to move this pointer back to the start. For example, let's say you have a file_to_read.txt and your python_script.py.

    file_to_read.txt

    Hello! My name is Bob and I can't think of anything to
    put in this file so I'm blabbering on about nonsense
    in hopes that you won't realise that this text is not
    important but the code in the actually file, though I
    think that you wouldn't mind reading this long file.
    

    python_script.py

    f = open("file_to_read.txt", "r")
    for line in f: print line
    for line in f: print line
    

    If you were to run this code (and no errors about directories occur) you would only have the file_to_read.txt printed once. To solve this, you can just add a f.seek(0, 0) in between reading. For example:

    f = open("file_to_read.txt", "r")
    for line in f: print line
    f.seek(0, 0)
    for lien in f: print line
    

    Now, back to your context, you can see how this applies to your code:

    import sys
    # Here is your reading of file
    blastfile = open(sys.argv[1],"r")
    column1list = []
    # Here is the first time you read the file
    for line in blastfile:
        b = line.split()[0]
        column1list.append(b)
    # Add a line to move back to the start before the
    # next reading
    blastfile.seek(0, 0)
    
    uniqcolumn1 = list(set(column1list))
    
    for val in uniqcolumn1:
        # Move the counter inside to refresh it after every iteration
        counter = 0
        # Here is the second time you read your file
        for line in blastfile:
            while counter <= 10:
                if line.startswith(val):
                    print line
                    counter += 1
        # Since you are going to read the file the next iteration,
        # .seek the file
        blastfile.seek(0, 0)
    

    EDIT

    Here is the second half of the code, fixed. You can do it either:

    for val in uniqcolumn1:
        # Move the counter in
        counter = 0
        # Move the while loop out
        while counter <= 10:
            for line in blastfile:
                if line.startswith(val):
                    print line,
                    counter += 1
        blastfile.seek(0, 0)
    

    The benefit of this is that the for loop terminates earlier, it does not read through the whole file.

    The other way is to use this:

    for val in uniqcolumn1:
        # Move counter in
        counter = 0
        # Remove while statement
        for line in blastfile:
            # Add additional condition to if statement
            if line.startswith(val) and counter <= 10:
                print line,
                counter += 1
            elif counter > 10:
                break
        blastfile.seek(0, 0)
    

    The benefit of this is that it looks simpler.