Search code examples
pythonpython-3.xtextfasta

Writing a file using two different texts as input in python


I have two big .fasta files that look like this:

File 1:

>A01
ABCDGENG
>A02
JALSKDLAS
#and so on

File 2:

>KJ01
KGLW
>XB02
CTRIPIO
#and so on

And I want to generate individual files for each pair of entries (both share the same length, so i.e the first output would look like this:

>A01
ABCDGENG
>KJ01
KGLW

And, as a detail, they are named after the first file, so the example would be called A01.fasta.

I already have a script that works great for when I have only one big file, but I need some guidance on how to add the second part to each individual file. Here's the script:

import os
from os import path
import sys

infile=open("D:/path_to_file")
os.system("D:/path_to_project") # Add your project directory in here

path = "D:/path_to_project"

opened = False # Assume outfile is not open
i=0
for line_ref in infile:
    if line_ref[0] == ">": # If line begins with ">"
        i =i+1 #in case that there are files with the same name
        if(opened): 
            outfile.close() # Will close the outfile if it is open 
        opened = True # Set opened to True to represent an opened outfile
        contig_name = line_ref[1:].rstrip() #Extract contig name: remove ">", extract contig string, remove any spaces or new lins following file
        print("contig: " + contig_name)
        outfile=open(path + "/" + str(contig_name) +"-"+ str(i)+ ".fasta", 'w')
    outfile.write(line_ref)    
outfile.close()
print("Fin")

But I don't know how to also go through the lines of the other file (File 2) and add them under the first one, without closing the file and opening it again. Thanks in advance!


Solution

  • Doing this with normal file operations will be complicated, since FASTA sequences can be a variable number of lines. It's best to use a library to parse the files, such as pyfastx

    import pyfastx
    
    fa1 = pyfastx.Fastx('file1.fasta')
    fa2 = pyrastx.Fastx('file2.fasta')
    
    for index, ((name1, seq1, comment1), (name2, seq2, comment2)) in enumerate(zip(fa1, fa2), 1):
        with open(f"outfile{index}.fasta", "w") as out:
            out.write(f">{name1}\n{seq1}\n{name2}\n{seq2}\n")