About python loop

I have two path

path1 = "/home/x/nearline"
path2 = "/home/x/sge_jobs_output"

In path1, I have a bunch of fastq files:

ERR001268_1.recal.fastq.gz
ERR001268_2.recal.fastq.gz
ERR001269_1.recal.fastq.gz
ERR001269_2.recal.fastq.gz
.............

In path2, I have many .txt corresponding to the fastq files in path1:

ERR001268_1.txt
ERR001268_2.txt
ERR001269_1.txt
ERR001269_2.txt
.............

NOw I've made script to calculate fastq_seq_num from fastq files in path1, see below:

for file in os.listdir(path1):
  if re.match('.*\.recal.fastq.gz', file):
    fullpath1 = os.path.join(path1, file)
#To calculate the sequence number in fastq.gz files  
    result = commands.getoutput('zcat ' + fullpath1 + ' |wc -l')
    fastq_seq_num = int(result)/4.0
  print file,fastq_seq_num

And also calculate num_seq_processed_sai from .txt files in path2, see below:

for file in os.listdir(path2):
  if re.match('.*\.txt', file):
      fullpath2 = os.path.join(path2, file)
#To calculate how many sequences have been processed in .sai file
      linelist = open (fullpath2,'r').readlines
      lastline = linelist[len(linelist)-1]
      num_seq_processed_sai = lastline.split(']')[1].split()[0]
  print file,num_seq_processed_sai

OK, now my problem is : I want to create a loop, in which I calculate fastq_seq_num for the FIRST fastq file in path1; then calculate num_seq_processed for the FIRST txt file in path2; then compare this two numbers; then end the loop. Then the second loop starts...How can I desgin some loop to achieve this? thanks!!!

Solution

Iterate the fastq files, check for the corresponding .txt file and if the pair exists, run your processing and compare the outputs

import commands
import glob
from os import path

dir1 = '/home/x/nearline'
dir2 = '/home/x/sge_jobs_output'

for filepath in glob.glob(path.join(dir1, '*.recal.fastq.gz')):
    filename = path.basename(filepath)
    job_id = filename.split('.', 1)[0]

    ## Look for corresponding .txt file
    txt_filepath = path.join(dir2, '%s.txt' % job_id)
    ## Fail early if corresponding .txt file is missing
    if not path.isfile(txt_filepath):
        print('Missing %s for %s' % (txt_filepath, filepath))
        continue

    ## Both exist, process each
    ## This is from your code snippet
    result = commands.getoutput('zcat ' + fullpath1 + ' |wc -l')
    fastq_seq_num = int(result)/4.0

    linelist = open(txt_filepath).readlines()
    lastline = linelist[len(linelist)-1]
    num_seq_processed_sai = lastline.split(']')[1].split()[0]

    if fastq_seq_num == num_seq_processed_sai:
        print "Sequence numbers match (%d : %d)" % (fastq_seq_num, num_seq_processed_sai)
    else:
        print "Sequence numbers do not match (%d : %d)" % (fastq_seq_num, num_seq_processed_sai)

I suggest using glob.glob() to list your files (as I've done in this snippet).

I also suggest replacing commands with subprocess, it's newer and I think commands is deprecated.

Also, reading all lines of a .txt file to get at the last line is inefficient and may cause memory issues if the files are large. Consider calling tail to get the last line (*nix only I think).