Search code examples
pythonloopsfastq

About python loop


I have two path

path1 = "/home/x/nearline"
path2 = "/home/x/sge_jobs_output"

In path1, I have a bunch of fastq files:

ERR001268_1.recal.fastq.gz
ERR001268_2.recal.fastq.gz
ERR001269_1.recal.fastq.gz
ERR001269_2.recal.fastq.gz
.............

In path2, I have many .txt corresponding to the fastq files in path1:

ERR001268_1.txt
ERR001268_2.txt
ERR001269_1.txt
ERR001269_2.txt
.............

NOw I've made script to calculate fastq_seq_num from fastq files in path1, see below:

for file in os.listdir(path1):
  if re.match('.*\.recal.fastq.gz', file):
    fullpath1 = os.path.join(path1, file)
#To calculate the sequence number in fastq.gz files  
    result = commands.getoutput('zcat ' + fullpath1 + ' |wc -l')
    fastq_seq_num = int(result)/4.0
  print file,fastq_seq_num 

And also calculate num_seq_processed_sai from .txt files in path2, see below:

for file in os.listdir(path2):
  if re.match('.*\.txt', file):
      fullpath2 = os.path.join(path2, file)
#To calculate how many sequences have been processed in .sai file
      linelist = open (fullpath2,'r').readlines
      lastline = linelist[len(linelist)-1]
      num_seq_processed_sai = lastline.split(']')[1].split()[0]
  print file,num_seq_processed_sai

OK, now my problem is : I want to create a loop, in which I calculate fastq_seq_num for the FIRST fastq file in path1; then calculate num_seq_processed for the FIRST txt file in path2; then compare this two numbers; then end the loop. Then the second loop starts...How can I desgin some loop to achieve this? thanks!!!


Solution

  • Iterate the fastq files, check for the corresponding .txt file and if the pair exists, run your processing and compare the outputs

    import commands
    import glob
    from os import path
    
    dir1 = '/home/x/nearline'
    dir2 = '/home/x/sge_jobs_output'
    
    for filepath in glob.glob(path.join(dir1, '*.recal.fastq.gz')):
        filename = path.basename(filepath)
        job_id = filename.split('.', 1)[0]
    
        ## Look for corresponding .txt file
        txt_filepath = path.join(dir2, '%s.txt' % job_id)
        ## Fail early if corresponding .txt file is missing
        if not path.isfile(txt_filepath):
            print('Missing %s for %s' % (txt_filepath, filepath))
            continue
    
        ## Both exist, process each
        ## This is from your code snippet
        result = commands.getoutput('zcat ' + fullpath1 + ' |wc -l')
        fastq_seq_num = int(result)/4.0
    
        linelist = open(txt_filepath).readlines()
        lastline = linelist[len(linelist)-1]
        num_seq_processed_sai = lastline.split(']')[1].split()[0]
    
        if fastq_seq_num == num_seq_processed_sai:
            print "Sequence numbers match (%d : %d)" % (fastq_seq_num, num_seq_processed_sai)
        else:
            print "Sequence numbers do not match (%d : %d)" % (fastq_seq_num, num_seq_processed_sai)
    

    I suggest using glob.glob() to list your files (as I've done in this snippet).

    I also suggest replacing commands with subprocess, it's newer and I think commands is deprecated.

    Also, reading all lines of a .txt file to get at the last line is inefficient and may cause memory issues if the files are large. Consider calling tail to get the last line (*nix only I think).