Search code examples
pythonsamtools

My merge overlapping coordinate python script cost too much memory


I'm trying to choose unique regions for my Non-invasive Prenatal Testing (NIPT) project. I have done following steps:

  • Create an artificial fasta file contains 50bp sequence. On each chromosome, the next sequence overlap 40bp from previous sequence
  • Align and only chosen no mismatch sequence

I have a .sam file about 40gb, on the next step I try to merge all overlapping coordinates to one .bed file for using in samtools. This is my python script to do that:

import glob
import time

folder = glob.glob('analysys/*.sam')
core = 22
nst = [f'chr{x+1}' for x in range(22)] + ['chrX','chrY']
for file in folder:
    now = time.time()
    print(f'Analyzing {file}')
    raw = []
    treat1 = []
    treat2 = []
    with open(file) as samfile:
        print('Importing coordinates')
        for line in samfile:
            coord_raw = line.split('\t')[0]
            coord_chr = coord_raw.split('_')[0]
            coord_loc = [int(r) for r in coord_raw.split('_')[1].split('-')] #0: begin, 1: end
            raw.append(({coord_chr},{coord_loc[0]},{coord_loc[1]}))
        print('Begin merging overlapping area...')
        last_coord = () #0:chr, 1:begin, 2:end
        for chrom ,begin , end in raw:
            if last_coord == () or last_coord[0] != chrom:
                last_coord = (chrom,begin,end)
            else:
                if last_coord[0] == chrom:
                    if begin > last_coord[1] and end < last_coord[2]:
                        last_coord = (chrom,last_coord[1],end)
                    else:
                        treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
                        last_coord = (chrom,begin,end)
                else:
                    treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
                    last_coord = (chrom,begin,end)       
        print('Begin merging nearby area...')                         
        last_coord = ()
        for chrom ,begin , end in treat1:
            if last_coord == ():
                last_coord = (chrom,begin,end) 
            else:
                if chrom == last_coord[0]:
                    if begin == last_coord[2] + 1:
                        last_coord = (last_coord[0],last_coord[1],end)
                    else:
                        treat2.append(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}')
                        last_coord = (chrom,begin,end)
                else:
                    treat2.append(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}')
                    last_coord = (chrom,begin,end)
print('Outputting...')
with open('unique_coord.bed','w') as bedfile:
    bedfile.write('\n'.join(treat2))
print(f'Program completed! Total time: {int(time.time() - now)} seconds')

However, after 30 minutes running, I found that the program consumed all of my memory and crashed. Is there any advice for me to reduce the amount of memory this script consumed? thank you very much


Solution

  • Hopefully this does the trick. Hard to know since I'm working without having the input file or knowing what the output should look like and can't really run the code to check for errors, but here's what I've tried to do:

    1. Eliminate raw and go straight to creating treat1; and
    2. Eliminate treat2 and go straight to writing to bedfile. I opened bedfile right off the bat and everything in your program is done with that file open.

    If this does what you need but still crashes, then perhaps you can read through each file twice to get last_coord at the end of the treat1 and then read through it again to "recreate" each line of treat1 individually and apply it to defining what needs to be written into the file.

    Without really knowing the details of what you're doing (I do not work anywhere close to a field that would apply samtools).

    import glob
    import time
    
    folder = glob.glob('analysys/*.sam')
    
    # These two lines are not used
    #core = 22
    #nst = [f'chr{x+1}' for x in range(22)] + ['chrX','chrY']
    
    # moved this to outside the for loop since the time check was 
    # after the for loop
    now = time.time() 
    
    with open('unique_coord.bed','w') as bedfile:
        for file in folder:
            print(f'Analyzing {file}')
            raw = []
            treat1 = []
            with open(file) as samfile:
                print('Importing coordinates')
                last_coord = ()
                for line in samfile:
                    chrom = line.split('\t')[0]
                    begin = coord_raw.split('_')[0]
                    end = [int(r) for r in coord_raw.split('_')[1].split('-')] #0: begin, 1: end
                    if last_coord == () or last_coord[0] != chrom:
                        last_coord = (chrom,begin,end)
                    else:
                        if last_coord[0] == chrom:
                            if begin > last_coord[1] and end < last_coord[2]:
                                last_coord = (chrom,last_coord[1],end)
                            else:
                                treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
                                last_coord = (chrom,begin,end)
                        else:
                            treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
                            last_coord = (chrom,begin,end)       
                print('Begin merging nearby area...')                         
                last_coord = ()
                for chrom ,begin , end in treat1:
                    if last_coord == ():
                        last_coord = (chrom,begin,end) 
                    else:
                        if chrom == last_coord[0]:
                            if begin == last_coord[2] + 1:
                                last_coord = (last_coord[0],last_coord[1],end)
                            else:
                                bedfile.write(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}\n')
                                last_coord = (chrom,begin,end)
                        else:
                            bedfile.write(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}\n')
                            last_coord = (chrom,begin,end)
    
    print(f'Program completed! Total time: {int(time.time() - now)} seconds')