Search code examples
pythonxmlxml-parsingannotationsbrat

How to convert txt.knowtator.xml file to .ann?


I have an annotated dataset in txt.knowtator.xml format

<?xml version="1.0" encoding="UTF-8"?>
<annotations textSource="file.txt">
    <annotation>
        <mention id="EHOST_Instance_93" />
        <annotator id="01">Unknown</annotator>
        <span start="127" end="237" />
        <spannedText>Omeprazole</spannedText>
        <creationDate>Wed Mar 11 09:52:01 GMT 2010</creationDate>
    </annotation>
    <classMention id="EHOST_Instance_93">
        <mentionClass id="Treatment">Omeprazole</mentionClass>
    </classMention>
    <annotation>
        <mention id="EHOST_Instance_94" />
        <annotator id="01">Unkown</annotator>
        <span start="600" end="612" />
        <spannedText>Tegretol</spannedText>
        <creationDate>Wed Mar 11 09:55:11 GMT 2010</creationDate>
    </annotation>
    <classMention id="EHOST_Instance_94">
        <mentionClass id="Treatment">Tegretol</mentionClass>
</annotations>

I need to get it into standoff BRAT format (.ann), such as:

T1    Treatment 127 137    Omeprazole
T2    Treatment 600 612    Tegretol

Is there any available tool for converting/parsing?


Solution

  • see below

    import xml.etree.ElementTree as ET
    
    xml = '''<?xml version="1.0" encoding="UTF-8"?>
    <annotations textSource="file.txt">
        <annotation>
            <mention id="EHOST_Instance_93" />
            <annotator id="01">Unknown</annotator>
            <span start="127" end="237" />
            <spannedText>Omeprazole</spannedText>
            <creationDate>Wed Mar 11 09:52:01 GMT 2010</creationDate>
        </annotation>
        <classMention id="EHOST_Instance_93">
            <mentionClass id="Treatment">Omeprazole</mentionClass>
        </classMention>
    </annotations>'''
    
    root = ET.fromstring(xml)
    print(f'T1    Treatment {root.find(".//span").attrib["start"]} {root.find(".//span").attrib["end"]} {root.find(".//spannedText").text}')
    

    output

    T1    Treatment 127 237 Omeprazole