Search code examples
pythondataframebiopythonphylogeny

Biopython gives ValueError: Sequences must all be the same length even though sequences are of the same length


I'm trying to create a phylogenetic tree by making a .phy file from my data.

I have a dataframe

ndf=

ESV trunc

1 esv1 TACGTAGGTG...

2 esv2 TACGGAGGGT...

3 esv3 TACGGGGGG...

7 esv7 TACGTAGGGT...

I checked the length of the elements of the column "trunc":

length_checker = np.vectorize(len)

arr_len = length_checker(ndf['trunc'])

The resulting arr_len gives the same length (=253) for all the elements.

I saved this dataframe as .phy file, which looks like this:

23 253

  1. esv1 TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG

  2. esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG

  3. esv3 TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG

This is similar to the file used in this tutorial.

However, when I run the command aln = AlignIO.read('msa.phy', 'phylip')

I get "ValueError: Sequences must all be the same length"

I don't know why I'm getting this or how to fix it. Any help is greatly appreciated!

Thanks


Solution

  • First, please read the answer to How to make good reproducible pandas examples. In the future please provide a minimal reproducibl example.

    Secondly, Michael G is absolutely correct that phylip is a format that is very peculiar about its syntax.

    The code below will alow you to generate a phylogenetic tree from your Pandas dataframe.

    First some imports and let's recreate your dataframe.

    import pandas as pd
    from Bio import Phylo
    from Bio.Phylo.TreeConstruction import DistanceCalculator
    from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
    from Bio import AlignIO
    
    data = {'ESV' : ['esv1', 'esv2', 'esv3'],
            'trunc': ['TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG',
                     'TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG',
                     'TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG']
    
    }
    
    ndf = pd.DataFrame.from_dict(data)
    print(ndf)
    

    Output:

        ESV                                              trunc
    0  esv1  TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCG...
    1  esv2  TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTG...
    2  esv3  TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCG...
    

    Next, write the phylip file in the correct format.

    with open("test.phy", 'w') as f:
        f.write("{:10} {}\n".format(ndf.shape[0], ndf.trunc.str.len()[0]))
        for row in ndf.iterrows():
            f.write("{:10} {}\n".format(*row[1].to_list()))
    

    Ouput of test.phy:

             3 253
    esv1       TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG
    esv2       TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
    esv3       TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG
    

    Now we can start with the creation of our phylogenetic tree.

    # Read the sequences and align
    aln = AlignIO.read('test.phy', 'phylip')
    print(aln)
    

    Output:

    SingleLetterAlphabet() alignment with 3 rows and 253 columns
    TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCG...AGG esv1
    TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGG...AGG esv2
    TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGG...CAG esv3
    

    Calculate the distance matrix:

    calculator = DistanceCalculator('identity')
    dm = calculator.get_distance(aln)
    print(dm)
    

    Output:

    esv1    0
    esv2    0.3003952569169961  0
    esv3    0.6086956521739131  0.6245059288537549  0
    

    Construct the phylogenetic tree using UPGMA algorithm and draw the tree in ascii

    constructor = DistanceTreeConstructor()
    tree = constructor.upgma(dm)
    Phylo.draw_ascii(tree)
    

    Output:

      ________________________________________________________________________ esv3
    _|
     |                                     ___________________________________ esv2
     |____________________________________|
                                          |___________________________________ esv1
    

    Or make a nice plot of the tree:

    Phylo.draw(tree)
    

    Output:

    enter image description here