Search code examples
listbioinformaticsbiopythonfasta

How to read a list of gene pairs and write a fasta file for each line


I'm new to bioinformatics and would really aprecciated some help!

I have a big multi-fasta file (genes.faa), like this:

>gene1_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene3_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene4_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
(...)

And a list of gene pairs (gene.pairs.txt), with two genes per line separeted by a tab:

gene13_A \t gene33_B
gene2_A \t gene48_B
gene56_A \t gene2_B

And I needed a way to read the list of gene pairs and create a fasta file for each line of the list of gene pairs. So, in this case, I would have 3 fasta files (the name of the output fasta files is not important), like this:

fasta1

>gene13_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene33_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC

fasta2

>gene2_A 
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene48_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC

fasta3

>gene56_A 
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC

I tried to write a script in python but I couldn't find a way to read the list in a loop and write a fasta file for each line. Thank you so much in advance for any help!


Solution

  • This code works. I have tested on the following files:

    Input FASTA file:

    >gene1_A
    MCTGTRNKIIRTCDNCRKRKIKCDRKRPA1
    >gene2_A
    MCTGTRNKIIRTCDNCRKRKIKCDRKRPA2
    >gene3_B
    MCTGTRNKIIRTCDNCRKRKIKCDRKRPA3
    >gene4_B
    MCTGTRNKIIRTCDNCRKRKIKCDRKRPA4
    

    Gene List file:

    gene1_A gene3_B
    gene2_A gene4_B
    

    Code:

    from Bio import SeqIO
    
    my_gene_file   = open('genelist.txt','r')
    my_fasta_file  = open('input.fasta','r')
    
    my_list=[]
    for line in my_gene_file:
        line = line.strip(' \n')
        gene_id1, gene_id2 = line.split('\t')   
        gene_tuple = (gene_id1, gene_id2)
        my_list.append(gene_tuple)
    
    my_dict={}
    for seq_record in SeqIO.parse(my_fasta_file, "fasta"):   
        my_dict[seq_record.id] = seq_record.seq
    
    for item in my_list:
        if item[0] in my_dict and item[1] in my_dict:
            output_file_name=(f'{item[0]}_{item[1]}.fasta')
            f = open(output_file_name, "w")
            f.write(f'>{item[0]}\n{my_dict[item[0]]}\n>{item[1]}\n{my_dict[item[1]]}')
    

    It outputs files with fasta record ids in the filenames.