Search code examples
pythonbiopython

Loop through every file with specific format in a directory using sys argv


I'd like to loop through every file in a directory given by the user and apply a specific transformation for every file that ends with ".fastq".

Basically this would be the pipeline:

  1. User puts the directory of where those files are (in command line)
  2. Script loops through every file that has the format ".fastq" and applies specific transformation
  3. Script saves new output in ".fasta" format

This is what I have (python and biopython):

import sys, os
from Bio import SeqIO
from Bio.SeqIO.QualityIO import FastqGeneralIterator
from pathlib import Path

path = Path(sys.argv[1])
print(path)

glob_path = path.glob('*')

for file_path in glob_path:
    if file_path.endswith(".fastq"):
        with open(glob_path, "rU") as input_fq:
            with open("{}.fasta".format(file_path),"w") as output_fa:
                for (title, sequence, quality) in FastqGeneralIterator(input_fq):
                    output_fa.write(">%s\n%s\n" \
                                    % (title, sequence))

if not os.path.exists(path): 
    raise Exception("No file at %s." % path)

The script I have is running, but it is not producing the ouput (it is not creating the fasta file as desired). How could I make it so that the script loops through the files of a specific directory and passes the global path for each file onto the for loop so that the content of input_fq is read and a given transformation is saved onto the output_fa?


Solution

  • Your problem is with this line:

    with open(glob_path, "rU") as input_fq:
    

    Remember that glob_path is a list containing all of the files in the user-supplied directory. You want to open file_path, which represents each element of the list you are iterating over:

    with open(file_path, "rU") as input_fq:
    

    Also, to be more succinct, you can eliminate your first if statement by just globbing for the pattern "*.fastq":

    glob_path = path.glob('*.fastq')
    
    for file_path in glob_path:
        with open(file_path, "rU") as input_fq: