I'm trying to convert some data to a Newick format to create a tree using the NLTK tree package. However, I'm stuck trying to do this.
My data looks like:
A, A1, A1.1, A1.1.1, A1.1.2, A1.1.3, A1.1.3.1, A1.1.3.1.1, A1.1.3.1.1.1, A1.1.3.1.1.2, A1.1.3.1.1.3, A1.1.3.1.1.4, A1.1.3.1.1.4.1, A1.1.3.1.1.5, A1.1.3.2, A1.3.3, A1.1.4.
Thanks!
My attempt: https://www.online-python.com/EFW3Ceti8h
Which results in:
(Root ( A ( A1 ( A1.1 A1.1.1 A1.1.2 ( A1.1.3 ( A1.1.3.1 ( A1.1.3.1.1 A1.1.3.1.1.1 A1.1.3.1.1.2 A1.1.3.1.1.3 ( A1.1.3.1.1.4 ) A1.1.3.1.1.4.1 ) ) A1.1.3.1.1.5 A1.1.3.2 ) A1.1.3.3 A1.1.4 ) ) ) )
In tree structure: Normal print Pretty print
This does not seem entirely right. How can I fix this?
Looking at your Wikipedia reference, it seems the Newick format puts the parent node at the end of the sequence of bracketed and delimited child nodes? i.e. your tree could be described as:
(((A1.1.4,(A1.1.3.3,A1.1.3.2,((A1.1.3.1.1.5,(A1.1.3.1.1.4.1)A1.1.3.1.1.4,A1.1.3.1.1.3,A1.1.3.1.1.2,A1.1.3.1.1.1)A1.1.3.1.1)A1.1.3.1)A1.1.3,A1.1.2,A1.1.1)A1.1)A1)A;
To make the code slightly easier, I put your nodes in a list by surrounding the values with []
. Below I've formatted the list to show the structure for easier comparison to the tree.
nodes = [
{'code': 'A', 'name': 'Entity'},
{'code': 'A1', 'name': 'Physical Object'},
{'code': 'A1.1', 'name': 'Organism'},
{'code': 'A1.1.1', 'name': 'Archaeon'},
{'code': 'A1.1.2', 'name': 'Bacterium'},
{'code': 'A1.1.3', 'name': 'Eukaryote'},
{'code': 'A1.1.3.1', 'name': 'Animal'},
{'code': 'A1.1.3.1.1', 'name': 'Vertebrate'},
{'code': 'A1.1.3.1.1.1', 'name': 'Amphibian'},
{'code': 'A1.1.3.1.1.2', 'name': 'Bird'},
{'code': 'A1.1.3.1.1.3', 'name': 'Fish'},
{'code': 'A1.1.3.1.1.4', 'name': 'Mammal'},
{'code': 'A1.1.3.1.1.4.1', 'name': 'Human'},
{'code': 'A1.1.3.1.1.5', 'name': 'Reptile'},
{'code': 'A1.1.3.2', 'name': 'Fungus'},
{'code': 'A1.1.3.3', 'name': 'Plant'},
{'code': 'A1.1.4', 'name': 'Virus'}
]
I've written this generic function to get from a node list to a string in the format as I understand it:
def processNodesToNewick(nodes, f_v, f_h):
c_n = nodes.pop()
t = '('*(f_h(c_n)-1) + f_v(c_n)
while len(nodes) > 0:
p_n = c_n
c_n = nodes.pop()
if f_h(c_n) == f_h(p_n):
t += ',' + f_v(c_n)
if f_h(c_n) > f_h(p_n):
t += ',' + '('*(f_h(c_n)-f_h(p_n)) + f_v(c_n)
if f_h(c_n) < f_h(p_n):
t += ')'*(f_h(p_n)-f_h(c_n)) + f_v(c_n)
return t+';'
We need to define two helper functions for the exact node type, and call the processing function with them:
def nodeLabel(node):
return node['code']
def nodeHeight(node):
code = node['code']
if code == 'A':
return 1
code = code[1:]
return 2 + code.count('.')
print(processNodesToNewick(nodes, nodeLabel, nodeHeight))
we get the string at the top of the answer. If we expand that out a bit it looks to match the above node list:
((( A1.1.4,
( A1.1.3.3,
A1.1.3.2,
(( A1.1.3.1.1.5,
( A1.1.3.1.1.4.1
)A1.1.3.1.1.4,
A1.1.3.1.1.3,
A1.1.3.1.1.2,
A1.1.3.1.1.1
)A1.1.3.1.1
)A1.1.3.1
)A1.1.3,
A1.1.2,
A1.1.1
)A1.1
)A1
)A;