I'm new to bioinformatics and would really aprecciated some help!
I have a big multi-fasta file (genes.faa), like this:
>gene1_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene3_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene4_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
(...)
And a list of gene pairs (gene.pairs.txt), with two genes per line separeted by a tab:
gene13_A \t gene33_B
gene2_A \t gene48_B
gene56_A \t gene2_B
And I needed a way to read the list of gene pairs and create a fasta file for each line of the list of gene pairs. So, in this case, I would have 3 fasta files (the name of the output fasta files is not important), like this:
fasta1
>gene13_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene33_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
fasta2
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene48_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
fasta3
>gene56_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
I tried to write a script in python but I couldn't find a way to read the list in a loop and write a fasta file for each line. Thank you so much in advance for any help!
This code works. I have tested on the following files:
Input FASTA file:
>gene1_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA1
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA2
>gene3_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA3
>gene4_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA4
Gene List file:
gene1_A gene3_B
gene2_A gene4_B
Code:
from Bio import SeqIO
my_gene_file = open('genelist.txt','r')
my_fasta_file = open('input.fasta','r')
my_list=[]
for line in my_gene_file:
line = line.strip(' \n')
gene_id1, gene_id2 = line.split('\t')
gene_tuple = (gene_id1, gene_id2)
my_list.append(gene_tuple)
my_dict={}
for seq_record in SeqIO.parse(my_fasta_file, "fasta"):
my_dict[seq_record.id] = seq_record.seq
for item in my_list:
if item[0] in my_dict and item[1] in my_dict:
output_file_name=(f'{item[0]}_{item[1]}.fasta')
f = open(output_file_name, "w")
f.write(f'>{item[0]}\n{my_dict[item[0]]}\n>{item[1]}\n{my_dict[item[1]]}')
It outputs files with fasta record ids in the filenames.