Search code examples
bioinformaticsbiopythonfastabioconductor

extract/parse large multifasta into alignments using table (csv, tsv)


I often need to parse a large multifasta into individual multifastas for downstream alignments using a table generated from some another program/code.

I have a large multifasta (seq.fa):

>sp1_gene1
ATTAC
>sp1_gene2
CCATTA
...
>sp2_gene1
ATTAC
>sp1_gene2
TCGAGT

And I have a tsv file with the locus name in the first column, and the list of headers in subsequent columns. The number of fields in each row may not be equal because one species may not have it. But I can easily add headers for each species and put NA or something similar for missing data. The table (genes.tsv):

geneA    sp1_gene3    sp2_gene1
geneB    sp1_gene5    sp2_gene7
...

I want use the genes table to create individual multifastas (ideally with the name from the first column) with the headers and sequences to get something like this:

cat geneA.fa
>sp1_gene3
ATTAC
>sp2_gene1
ATTAC
...
cat geneB.fa
>sp1_gene5
TCGAGT
>sp2_gene7
ATTAC
...

I am familiar with bash (awk, grep, sed), and still learning R and python for bioinformatics. I was originally splitting up the table into individual files in bash, converting the fasta into a csv, then grepping and joining but it is really messy and doesn't always work. Any advice on scripts or packages that can do this? Thank you!


Solution

  • This solves your problem i think:

    sequences = {}
    
    with open("seq.fa") as my_fasta:
        for header in my_fasta:
            seq = next(my_fasta)
            sequences[header[1:].rstrip()] = seq.rstrip()
    
    with open("genes.tsv") as my_tsv:
        for line in my_tsv:
            splitted_line = line.split()
            gene_writer = open("/your/output/Dir/" + splitted_line[0] + ".fa", "w")
            for gene in splitted_line[1:]:
                if gene in sequences:
                    gene_writer.write(">" + gene + "\n")
                    gene_writer.write(sequences[gene] + "\n")
                else:
                    print(gene, "in tsv file but not in fasta")
            gene_writer.close()
    

    Breaking it down:

    sequences = {}
    
    with open("seq.fa") as my_fasta:
        for header in my_fasta:
            seq = next(my_fasta)
            sequences[header[1:].rstrip()] = seq.rstrip()
    

    this will create a Dictionary sequences with key the gene names, and value the sequence. Like so:

    {'sp1_gene1': 'ATTAC', 'sp1_gene2': 'TCGAGT', 'sp2_gene1': 'ATTAC'}
    

    The second part of the code iterates over the TSV file and for every line make a make a new .fa file and add the sequences in fasta format to that file.

    Hope this helps. :)