Search code examples
pythonpython-3.xphylogeny

Count number of groups (with specific TAG) within a specific format (with Python)


Hello everyone I need some help :

I do not know if you are familiar with phylogenetic tree but here is an exemple:

   /-YP_001604167.1
  |
  |--YP_001604351.1
--|
  |      /-seq_TAG2_Canis_taurus
  |   /-|
  |  |   \-seq_TAG2_Canis_austracus
   \-|
     |   /-YP_001798528.1
      \-|
        |   /-YP_009173671.1
         \-|
           |   /-seq_TAG1_Mus_musculus
            \-|
              |   /-seq_TAG1_Mus_griseus
               \-|
                 |   /-seq_TAG2_Canis_canis
                  \-|
                    |   /-seq_TAG2_Canis_familiaris
                     \-|
                        \-seq_TAG2_Canis_lupus

And this tree is coded by a specific format called newick :

'(YP_001604167.1,YP_001604351.1,((seq_TAG2_Canis_austracus,seq_TAG2_Canis_taurus),(YP_001798528.1,(YP_009173671.1,(seq_TAG1_Mus_musculus,(seq_TAG1_Mus_griseus,(seq_TAG2_Canis_lupus,(seq_TAG2_Canis_familiaris,seq_TAG2_Canis_canis))))))));'
  • Explanation of the format:

The tree ends with a semicolon. The bottommost node in this tree is an interior node, not a tip. Interior nodes are represented by a pair of matched parentheses. Between them are representations of the nodes (seq_names) that are immediately descended from that node, separated by commas.

son if I have something like :

(A,(B,C)); 

Then it means that B and C are more closely related each other and A is the most distant.

And the idea of my question was to find a way using for instance python to count the number of groups with the same "TAG_number" that are more close to each other than any other TAG_number or YP_number nodes.

For instance, the TAG2 in representated in 2 groups where (seq_TAG2_Canis_taurus, seq_TAG2_Canis_austracus) are together and the second group (seq_TAG2_Canis_canis, (seq_TAG2_Canis_familiaris , seq_TAG2_Canis_lupus)) are together. For the TAG1 as you can see, none of them is nested together because seq_TAG1_Mus_griseus is more close to the group (seq_TAG2_Canis_canis, (seq_TAG2_Canis_familiaris , seq_TAG2_Canis_lupus)) than it is with the other TAG1 seq_TAG1_Mus_musculus.

So the result should be something like :

groups for TAG_1 : 0 
groups for TAG_2 : 2 

I know that some packages in Python or R are available in order to tell if TAG_number are in "monophyletic groups" but there is nothing to tells the number of groups within the tree if TAG_number groups are splitted within the tree.

If you have any idea in order to do that? Thank you very much.

Other part of the question :

Now I have a Species phylogeny such as :

|         /-Canis_taurus
|      /-|
|     |   \-Canis_astracus
|   /-|
|  |  |   /-Canis_africus
|  |   \-|
|  |     |   /-Canis_familiaris
 \-|      \-|
   |         \-Canis _lupus
   |
   |   /-Canis_canis
    \-|
       \-Lupus_lupus

and The idea is within each monophyletic groups assesed in the previous process, to count within clades formed by the MRCA of the clades in the species phylogeny the number of nodes.

So I have 2 groups:

The first:

#    /-TAG2, seq_TAG2_Canis_austracus
# --|
#    \-TAG2, seq_TAG2_Canis_taurus
#

Here Canis_austracus and Canis_taurus share a MRCA in the species phylogeny and this ancestor forms the clade composed by 2 species (Canis_austracus and Canis_taurus)

So Nb species within species phylogenetic tree = 2

#    /-TAG2, seq_TAG2_Canis_lupus
# --|
#   |   /-TAG2, seq_TAG2_Canis_familiaris
#    \-|
#       \-TAG2, seq_TAG2_Canis_canis

Here the 3 taxa share a MRCA and this ancestor forms the clade composed by all species in the species phylogeny (7)

So Nb species within species phylogenetic tree = 7


Solution

  • Maybe get_monophyletic of ete3 is what you need? http://etetoolkit.org/docs/latest/reference/reference_tree.html?highlight=get_monophyletic#ete3.TreeNode.get_monophyletic

    from ete3 import Tree import re

    # build tree
    t = Tree("(YP_001604167.1,YP_001604351.1,"
             "((seq_TAG2_Canis_austracus,seq_TAG2_Canis_taurus),"
             "(YP_001798528.1,(YP_009173671.1,(seq_TAG1_Mus_musculus,"
             "(seq_TAG1_Mus_griseus,(seq_TAG2_Canis_lupus,"
             "(seq_TAG2_Canis_familiaris,seq_TAG2_Canis_canis))))))));")
    
    # set tag as leave attribute
    for leaf in t:
        # get tag from name
        tag = re.search('TAG[0-9]', leaf.name)
        tag = tag.group(0) if tag else None
        leaf.add_features(tag=tag)
    
    # show the hole tree
    print(t.get_ascii(attributes=["name", "tag"], show_internal=False))
    
    # show all monophyletic groups for tag=TAG2
    for node in t.get_monophyletic(values=["TAG2"], target_attr="tag"):
        print(node.get_ascii(attributes=["tag", "name"], show_internal=False))
    
    
    #    /-TAG2, seq_TAG2_Canis_austracus
    # --|
    #    \-TAG2, seq_TAG2_Canis_taurus
    #
    #    /-TAG2, seq_TAG2_Canis_lupus
    # --|
    #   |   /-TAG2, seq_TAG2_Canis_familiaris
    #    \-|
    #       \-TAG2, seq_TAG2_Canis_canis