Search code examples
pythonxmlorange

Import XML into Orange3 using Python script


I have an xml document in my computer that look something like this:

<?xml version="1.0" encoding=UTF-8"?>
<IPDatas xmlns:xsi="http://www.w3.org/...>
   <datas>
      <dna>
         <profile>
            <loci>
               <locus name="one">
                  <allele order="1">10</allele>
                  <allele order="2">12.3</allele>
               </locus>
               <locus name="two">
                  <allele order="1">11.1</allele>
                  <allele order="2">17</allele>
               </locus>
               <locus name="three">
                  <allele order="1">13.2</allele>
                  <allele order="2">12.3</allele>
               </locus>
            </loci>
         </profile>
      </dna>
   </datas>
</IPdatas> 

I want to import the document into Orange without first converting it outside Orange, so I probably need to use the "Python script" widget. After importing, I want to convert it into a table like this:

one_1 one_2 two_1 two_2 three_1 three_2
10 12.3 11.1 17 13.2 12.3

My knowledge of Python is poor, so any advice will be highly appreciated!


Solution

  • Something like the below:

    import xml.etree.ElementTree as ET
    import pprint
    
    
    xml = '''
    <IPDatas xmlns:xsi="http://www.w3.org/...">
       <datas>
          <dna>
             <profile>
                <loci>
                   <locus name="one">
                      <allele order="1">10</allele>
                      <allele order="2">12.3</allele>
                   </locus>
                   <locus name="two">
                      <allele order="1">11.1</allele>
                      <allele order="2">17</allele>
                   </locus>
                   <locus name="three">
                      <allele order="1">13.2</allele>
                      <allele order="2">12.3</allele>
                   </locus>
                </loci>
             </profile>
          </dna>
       </datas>
    </IPDatas> '''
    
    data = {}
    root = ET.fromstring(xml)
    locus_lst = root.findall('.//locus')
    for locus in locus_lst:
        name = locus.attrib['name']
        allele_lst = locus.findall('allele')
        for allele in allele_lst:
            final_name = f"{name}_{allele.attrib['order']}"
            value = float(allele.text)
            data[final_name] = value
    pprint.pprint(data)
    

    output (a dict that you should be able to use with Orange)

    {'one_1': 10.0,
     'one_2': 12.3,
     'three_1': 13.2,
     'three_2': 12.3,
     'two_1': 11.1,
     'two_2': 17.0}