Search code examples
pythonbioinformaticshierarchical-clusteringdendrogramgenome

Create a Dendogram from Genome


I wanted to play around with genomic data:

Species_A = ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag
Species_B = ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag
Species_C = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag
Species_D = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg
Species_E = ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag

I wanted to create a dendrogram based on how close these organisms are related to each other given the genome sequence above. What I did first was to count the number of a's, c's, t's and g's of each species then I created an array, then plotted a dendrogram:

gen_size1 = len(Species_A)
a1 = float(Species_A.count('a'))/float(gen_size1)
c1 = float(Species_A.count('c'))/float(gen_size1)
g1 = float(Species_A.count('g'))/float(gen_size1)
t1 = float(Species_A.count('t'))/float(gen_size1)
.
.
.
gen_size5 = len(Species_E)
a5 = float(Species_E.count('a'))/float(gen_size5)
c5 = float(Species_E.count('c'))/float(gen_size5)
g5 = float(Species_E.count('g'))/float(gen_size5)
t5 = float(Species_E.count('t'))/float(gen_size5)

my_genes = np.array([[a1,c1,g1,t1],[a2,c2,g2,t2],[a3,c3,g3,t3],[a4,c4,g4,t4],[a5,c5,g5,t5]])
plt.subplot(1,2,1)
plt.title("Mononucleotide")
linkage_matrix = linkage(my_genes, "single")
print linkage_matrix
dendrogram(linkage_matrix,truncate_mode='lastp', color_threshold=1, labels=[Species_A, Species_B, Species_C, Species_D, Species_E], show_leaf_counts=True)
plt.show()

Species A and B are variants of the same organism and I am expecting that both should diverge from a common clade form the root, same goes with Species C and D which should diverge from another common clade from the root then with Species E diverging from the main root because it is not related to Species A to D. Unfortunately the dendrogram result was mixed up with Species A and E diverging from a common clade, then Species C, D and B in another clade (pretty messed up).

I have read about hierarchical clustering for genome sequence but I have observed that it only accommodates 2 dimensional system, unfortunately I have 4 dimensions which are a,c,t and g. Any other strategy for this? thanks for the help!


Solution

  • This is a fairly common problem in bioinformatics, so you should use a bioinformatics library like BioPython that has this functionality builtin.

    First you create a multi FASTA file with your sequences:

    import os
    from Bio import SeqIO
    from Bio.Seq import Seq
    from Bio.SeqRecord import SeqRecord
    from Bio.Alphabet import generic_dna
    
    
    sequences = ['ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag',
                 'ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag',
                 'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag',
                 'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg',
                 'ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag']
    
    my_records = [SeqRecord(Seq(sequence, generic_dna),
                  id='Species_{}'.format(letter), description='Species_{}'.format(letter))
                  for sequence, letter in zip(sequences, 'ABCDE')]
    
    root_dir = r"C:\Users\BioGeek\Documents\temp"
    filename = 'my_sequences'
    fasta_path = os.path.join(root_dir, '{}.fasta'.format(filename))
    
    SeqIO.write(my_records, fasta_path, "fasta")
    

    This creates the file C:\Users\BioGeek\Documents\temp\my_sequences.fasta that looks like this:

    >Species_A
    ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag
    >Species_B
    ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag
    >Species_C
    ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag
    >Species_D
    ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg
    >Species_E
    ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag
    

    Next, use the command line tool ClustalW to do a multiple sequence alignment:

    from Bio.Align.Applications import ClustalwCommandline
    clustalw_exe = r"C:\path\to\clustalw-2.1\clustalw2.exe"
    assert os.path.isfile(clustalw_exe), "Clustal W executable missing"
    clustalw_cline = ClustalwCommandline(clustalw_exe, infile=fasta_path)
    stdout, stderr = clustalw_cline()    
    print stdout
    

    This prints:

     CLUSTAL 2.1 Multiple Sequence Alignments
    
    
    Sequence format is Pearson
    Sequence 1: Species_A     50 bp
    Sequence 2: Species_B     50 bp
    Sequence 3: Species_C     50 bp
    Sequence 4: Species_D     50 bp
    Sequence 5: Species_E     50 bp
    Start of Pairwise alignments
    Aligning...
    
    Sequences (1:2) Aligned. Score:  90
    Sequences (1:3) Aligned. Score:  94
    Sequences (1:4) Aligned. Score:  88
    Sequences (1:5) Aligned. Score:  84
    Sequences (2:3) Aligned. Score:  90
    Sequences (2:4) Aligned. Score:  84
    Sequences (2:5) Aligned. Score:  78
    Sequences (3:4) Aligned. Score:  94
    Sequences (3:5) Aligned. Score:  82
    Sequences (4:5) Aligned. Score:  82
    Guide tree file created:   [C:\Users\BioGeek\Documents\temp\my_sequences.dnd]
    
    There are 4 groups
    Start of Multiple Alignment
    
    Aligning...
    Group 1: Sequences:   2      Score:912
    Group 2: Sequences:   2      Score:921
    Group 3: Sequences:   4      Score:865
    Group 4: Sequences:   5      Score:855
    Alignment Score 2975
    
    CLUSTAL-Alignment file created  [C:\Users\BioGeek\Documents\temp\my_sequences.aln]
    

    The my_sequences.dnd file ClustalW creates, is a standard Newick tree file and Bio.Phylo can parse these:

    from Bio import Phylo
    newick_path = os.path.join(root_dir, '{}.dnd'.format(filename))
    tree = Phylo.read(newick_path, "newick")
    Phylo.draw_ascii(tree)
    

    Which prints:

           ____________ Species_A
      ____|
     |    |_____________________________________ Species_B
     |
    _|          ____ Species_C
     |_________|
     |         |_________________________ Species_D
     |
     |__________________________________________________________________ Species_E
    

    Or, if you have matplotlib or pylab installed, you can create a graphic using the draw function:

    tree.rooted = True
    Phylo.draw(tree, branch_labels=lambda c: c.branch_length)
    

    which produces:

    dendrogram

    This dendrogram clearly illustrates what you observed: that species A and B are variants of the same organism and both diverge from a common clade from the root. Same goes with Species C and D, both diverge from another common clade from the root. Finally, Species E diverges from the main root because it is not related to Species A to D.