I have two big .fasta files that look like this:
File 1:
>A01
ABCDGENG
>A02
JALSKDLAS
#and so on
File 2:
>KJ01
KGLW
>XB02
CTRIPIO
#and so on
And I want to generate individual files for each pair of entries (both share the same length, so i.e the first output would look like this:
>A01
ABCDGENG
>KJ01
KGLW
And, as a detail, they are named after the first file, so the example would be called A01.fasta
.
I already have a script that works great for when I have only one big file, but I need some guidance on how to add the second part to each individual file. Here's the script:
import os
from os import path
import sys
infile=open("D:/path_to_file")
os.system("D:/path_to_project") # Add your project directory in here
path = "D:/path_to_project"
opened = False # Assume outfile is not open
i=0
for line_ref in infile:
if line_ref[0] == ">": # If line begins with ">"
i =i+1 #in case that there are files with the same name
if(opened):
outfile.close() # Will close the outfile if it is open
opened = True # Set opened to True to represent an opened outfile
contig_name = line_ref[1:].rstrip() #Extract contig name: remove ">", extract contig string, remove any spaces or new lins following file
print("contig: " + contig_name)
outfile=open(path + "/" + str(contig_name) +"-"+ str(i)+ ".fasta", 'w')
outfile.write(line_ref)
outfile.close()
print("Fin")
But I don't know how to also go through the lines of the other file (File 2
) and add them under the first one, without closing the file and opening it again.
Thanks in advance!
Doing this with normal file operations will be complicated, since FASTA sequences can be a variable number of lines. It's best to use a library to parse the files, such as pyfastx
import pyfastx
fa1 = pyfastx.Fastx('file1.fasta')
fa2 = pyrastx.Fastx('file2.fasta')
for index, ((name1, seq1, comment1), (name2, seq2, comment2)) in enumerate(zip(fa1, fa2), 1):
with open(f"outfile{index}.fasta", "w") as out:
out.write(f">{name1}\n{seq1}\n{name2}\n{seq2}\n")