Search code examples
pythonpython-3.xxml-parsinglxmllxml.objectify

Issues Parsing Multi Nested Childs in XML using lxml


Im having issues parsing out each child node within an xml file. The number of nodes can change per Instrument_Root. For instance, Instrument_Watch is NULL here, but will be populated in other instances after this. My goal is to have each child node parsed individually (Instrument_Ratings, Instrument_Attribute_Ratings, Instrument_Organization, Instrument_Supports, etc.)

I tried doing the following, but it just returned the first intance repeatedly - there are 3700 Instrument_Root in the file, and Instrument_Rating for this one Instrument_Root was repeated 3700 times. I also ran into errors with etree due to the namespace.

from lxml import objectify

xml = objectify.parse(file)
root = xml.getroot()

tree1 = []
tree2 = []
tree3 = []
tree4 = []
for children in range(len(root.getchildren())):
    tree1.append([child.text for child in root.getchildren()[children].iterchildren()])
    for children2 in root.Instrument_Root.Instrument_Ratings.Instrument_Rating.getchildren():
        tree2.append([child2.text for child2 in  root.Instrument_Root.Instrument_Ratings.Instrument_Rating.getchildren()])
        for children3 in root.Instrument_Root.Instrument_Ratings.Instrument_Rating.Instrument_Rating_Attributes.Instrument_Rating_Attribute.getchildren():
            tree3.append([child3.text for child3 in root.Instrument_Root.Instrument_Ratings.Instrument_Rating.Instrument_Rating_Attributes.Instrument_Rating_Attribute.getchildren()])
            for children4 in root.Instrument_Root.Instrument_Organizations.Instrument_Organization.getchildren():
                tree4.append([child4.text for child4 in root.Instrument_Root.Instrument_Organizations.Instrument_Organization.getchildren()])

XML:

<?xml version="1.0" encoding="utf-8"?>              
<Instrument_Roots xmlns="http://www.XXXXX.com" xmlns:xsi="http://www.XXXXXXX.XMLSchema-instance" file_type="Baseline" frequency="Hourly-12" generation_time="2020-04-06T12:00:00Z">             
    <Instrument_Root>           
        <Instrument_ID>831295951</Instrument_ID>        
        <Deal_number>831275547</Deal_number>        
        <Class_Code>18705</Class_Code>      
        <Class_Text>Pass-Through</Class_Text>       
        <Class_Short_Description>PAS</Class_Short_Description>      
        <Dated_Date>2020-03-21T00:00:00</Dated_Date>        
        <ISO_Currency_Code>AUD</ISO_Currency_Code>      
        <Currency_Multiple_Indicator>N</Currency_Multiple_Indicator>        
        <Maturity_Date>2051-03-21T00:00:00</Maturity_Date>      
        <Maturity_Year>2051</Maturity_Year>     
        <Sale_Date>2020-03-21T00:00:00</Sale_Date>      
        <Face_Amount_USD>7.2534316791</Face_Amount_USD>     
        <Credit_Linked_Indicator>N</Credit_Linked_Indicator>        
        <Takedown_Indicator>N</Takedown_Indicator>      
        <Security_Description>Class B</Security_Description>        
        <Instrument_Type_Code>24657</Instrument_Type_Code>      
        <Instrument_Type_Text>PASS-THRU CTFS</Instrument_Type_Text>     
        <Private_Placement_Code>24922</Private_Placement_Code>      
        <Private_Placement_Text>Not Applicable</Private_Placement_Text>     
        <Coupon_Type_Code>26</Coupon_Type_Code>     
        <Coupon_Type_Text>Floating</Coupon_Type_Text>       
        <Coupon_Type_Short_Description>FLT</Coupon_Type_Short_Description>      
        <Coupon_Frequency_Code>16</Coupon_Frequency_Code>       
        <Coupon_Frequency_Text>Monthly</Coupon_Frequency_Text>      
        <Coupon_Frequency_Short_Description>MON</Coupon_Frequency_Short_Description>        
        <Coupon_Rate xsi:nil="true"/>       
        <Instrument_Description xsi:nil="true"/>        
        <Product_Line_Description>MBS - Prime</Product_Line_Description>        
        <Series_Class_Text>Class B</Series_Class_Text>      
        <Market_Description>AUSTRALIA</Market_Description>      
        <Face_Amount>11.2500000000</Face_Amount>        
        <Structured_Finance_Indicator>Y</Structured_Finance_Indicator>      
        <Structured_Finance_Sequence_Number>3</Structured_Finance_Sequence_Number>      
        <Instrument_Ratings>        
            <Instrument_Rating> 
                <Instrument_ID>831295951</Instrument_ID>
                <Rating_Class_Number>37203</Rating_Class_Number>
                <Rating_Date>2020-03-02T01:30:03</Rating_Date>
                <XXXX_Rating_ID>831295958</XXXX_Rating_ID>
                <Rating_Level>I</Rating_Level>
                <Rating_Class_Text>Senior Secured</Rating_Class_Text>
                <Security_Class_Code>18705</Security_Class_Code>
                <Security_Class_Text>Pass-Through</Security_Class_Text>
                <Security_Class_Short_Description>PAS</Security_Class_Short_Description>
                <Duration_Code>25636</Duration_Code>
                <Duration_Text>Long-Term Debt Rating</Duration_Text>
                <Duration_Short_Description>LT</Duration_Short_Description>
                <Seniority_Code>18743</Seniority_Code>
                <Seniority_Text>Senior Secured</Seniority_Text>
                <Seniority_Short_Description>SS</Seniority_Short_Description>
                <Evaluation_Type_Code>25648</Evaluation_Type_Code>
                <Evaluation_Type_Text>Credit Risk</Evaluation_Type_Text>
                <Shadow_Code>5734</Shadow_Code>
                <Shadow_Text>Enhanced</Shadow_Text>
                <Shadow_Short_Description>ENH</Shadow_Short_Description>
                <Rating_Subclass_Code xsi:nil="true"/>
                <Rating_Subclass_Text xsi:nil="true"/>
                <Currency_Capd_Code>19142</Currency_Capd_Code>
                <Currency_Capd_Text>Local Currency</Currency_Capd_Text>
                <Rating_Text>NR</Rating_Text>
                <Credit_Grade xsi:nil="true"/>
                <Rating_Rank>0</Rating_Rank>
                <Rating_Direction_Code>19102</Rating_Direction_Code>
                <Rating_Direction_Text>DECISION NOT TO RATE</Rating_Direction_Text>
                <Rating_Direction_Short_Description>NR</Rating_Direction_Short_Description>
                <Rating_Type_Code>534</Rating_Type_Code>
                <Rating_Type_Text>Long-Term Debt Rating</Rating_Type_Text>
                <Rating_Type_Short_Description>LT</Rating_Type_Short_Description>
                <Rating_Enhancement_Level>ENH</Rating_Enhancement_Level>
                <Rating_Local_Date>2020-03-02T17:30:03</Rating_Local_Date>
                <Rating_Termination_Date xsi:nil="true"/>
                <Rating_Termination_Local_Date xsi:nil="true"/>
                <Rating_Reason_Code>25530</Rating_Reason_Code>
                <Rating_Reason_Text>DECISION NOT TO RATE</Rating_Reason_Text>
                <Rating_Currency_Code>20525</Rating_Currency_Code>
                <Rating_Currency_Text>Australian Dollar</Rating_Currency_Text>
                <Rating_Currency_ISO_Code>AUD</Rating_Currency_ISO_Code>
                <Rating_Monitor_Indicator>1</Rating_Monitor_Indicator>
                <Initial_Rating_Indicator>Y</Initial_Rating_Indicator>
                <Instrument_Watchlist xsi:nil="true"/>
                <Instrument_Rating_Attributes>
                </Instrument_Rating_Attributes>
            </Instrument_Rating>    
        </Instrument_Ratings>       
        <Instrument_Supports xsi:nil="true"/>       
        <Instrument_Organizations>      
            <Instrument_Organization>   
                <Organization_ID>88100</Organization_ID>
                <Instrument_ID>831295951</Instrument_ID>
                <Organization_Role_Code>5156484</Organization_Role_Code>
                <Organization_Role_Text>Issuer Account Bank</Organization_Role_Text>
                <Effective_Date>2020-03-02T01:30:03</Effective_Date>
                <Termination_Date xsi:nil="true"/>
            </Instrument_Organization>  
            <Instrument_Organization>   
                <Organization_ID>435700</Organization_ID>
                <Instrument_ID>831295951</Instrument_ID>
                <Organization_Role_Code>141</Organization_Role_Code>
                <Organization_Role_Text>Underwriter</Organization_Role_Text>
                <Effective_Date>2020-03-02T01:30:03</Effective_Date>
                <Termination_Date xsi:nil="true"/>
            </Instrument_Organization>  
            <Instrument_Organization>   
                <Organization_ID>600045234</Organization_ID>
                <Instrument_ID>831295951</Instrument_ID>
                <Organization_Role_Code>25017</Organization_Role_Code>
                <Organization_Role_Text>Seller</Organization_Role_Text>
                <Effective_Date>2020-03-02T01:30:03</Effective_Date>
                <Termination_Date xsi:nil="true"/>
            </Instrument_Organization>  
            <Instrument_Organization>   
                <Organization_ID>600045234</Organization_ID>
                <Instrument_ID>831295951</Instrument_ID>
                <Organization_Role_Code>25020</Organization_Role_Code>
                <Organization_Role_Text>Servicer</Organization_Role_Text>
                <Effective_Date>2020-03-02T01:30:03</Effective_Date>
                <Termination_Date xsi:nil="true"/>
            </Instrument_Organization>  
            <Instrument_Organization>   
                <Organization_ID>530750</Organization_ID>
                <Instrument_ID>831295951</Instrument_ID>
                <Organization_Role_Code>24998</Organization_Role_Code>
                <Organization_Role_Text>Arranger</Organization_Role_Text>
                <Effective_Date>2020-03-02T01:30:03</Effective_Date>
                <Termination_Date xsi:nil="true"/>
            </Instrument_Organization>  
            <Instrument_Organization>   
                <Organization_ID>540500</Organization_ID>
                <Instrument_ID>831295951</Instrument_ID>
                <Organization_Role_Code>141</Organization_Role_Code>
                <Organization_Role_Text>Underwriter</Organization_Role_Text>
                <Effective_Date>2020-03-02T01:30:03</Effective_Date>
                <Termination_Date xsi:nil="true"/>
            </Instrument_Organization>  
            <Instrument_Organization>   
                <Organization_ID>714325</Organization_ID>
                <Instrument_ID>831295951</Instrument_ID>
                <Organization_Role_Code>141</Organization_Role_Code>
                <Organization_Role_Text>Underwriter</Organization_Role_Text>
                <Effective_Date>2020-03-02T01:30:03</Effective_Date>
                <Termination_Date xsi:nil="true"/>
            </Instrument_Organization>  
            <Instrument_Organization>   
                <Organization_ID>192835</Organization_ID>
                <Instrument_ID>831295951</Instrument_ID>
                <Organization_Role_Code>141</Organization_Role_Code>
                <Organization_Role_Text>Underwriter</Organization_Role_Text>
                <Effective_Date>2020-03-02T01:30:03</Effective_Date>
                <Termination_Date xsi:nil="true"/>
            </Instrument_Organization>  
            <Instrument_Organization>   
                <Organization_ID>600020914</Organization_ID>
                <Instrument_ID>831295951</Instrument_ID>
                <Organization_Role_Code>141</Organization_Role_Code>
                <Organization_Role_Text>Underwriter</Organization_Role_Text>
                <Effective_Date>2020-03-02T01:30:03</Effective_Date>
                <Termination_Date xsi:nil="true"/>
            </Instrument_Organization>  
            <Instrument_Organization>   
                <Organization_ID>831295948</Organization_ID>
                <Instrument_ID>831295951</Instrument_ID>
                <Organization_Role_Code>129</Organization_Role_Code>
                <Organization_Role_Text>Issuer</Organization_Role_Text>
                <Effective_Date>2020-03-02T01:30:03</Effective_Date>
                <Termination_Date xsi:nil="true"/>
            </Instrument_Organization>  
            <Instrument_Organization>   
                <Organization_ID>600018753</Organization_ID>
                <Instrument_ID>831295951</Instrument_ID>
                <Organization_Role_Code>140</Organization_Role_Code>
                <Organization_Role_Text>Trustee</Organization_Role_Text>
                <Effective_Date>2020-03-02T01:30:03</Effective_Date>
                <Termination_Date xsi:nil="true"/>
            </Instrument_Organization>  
            <Instrument_Organization>   
                <Organization_ID>600045234</Organization_ID>
                <Instrument_ID>831295951</Instrument_ID>
                <Organization_Role_Code>25002</Organization_Role_Code>
                <Organization_Role_Text>Custodian</Organization_Role_Text>
                <Effective_Date>2020-03-02T01:30:03</Effective_Date>
                <Termination_Date xsi:nil="true"/>
            </Instrument_Organization>  
            <Instrument_Organization>   
                <Organization_ID>806942842</Organization_ID>
                <Instrument_ID>831295951</Instrument_ID>
                <Organization_Role_Code>140</Organization_Role_Code>
                <Organization_Role_Text>Trustee</Organization_Role_Text>
                <Effective_Date>2020-03-02T01:30:03</Effective_Date>
                <Termination_Date xsi:nil="true"/>
            </Instrument_Organization>  
            <Instrument_Organization>   
                <Organization_ID>820688318</Organization_ID>
                <Instrument_ID>831295951</Instrument_ID>
                <Organization_Role_Code>5155837</Organization_Role_Code>
                <Organization_Role_Text>Cash Manager</Organization_Role_Text>
                <Effective_Date>2020-03-02T01:30:03</Effective_Date>
                <Termination_Date xsi:nil="true"/>
            </Instrument_Organization>  
            <Instrument_Organization>   
                <Organization_ID>530750</Organization_ID>
                <Instrument_ID>831295951</Instrument_ID>
                <Organization_Role_Code>141</Organization_Role_Code>
                <Organization_Role_Text>Underwriter</Organization_Role_Text>
                <Effective_Date>2020-03-02T01:30:03</Effective_Date>
                <Termination_Date xsi:nil="true"/>
            </Instrument_Organization>  
            <Instrument_Organization>   
                <Organization_ID>600045234</Organization_ID>
                <Instrument_ID>831295951</Instrument_ID>
                <Organization_Role_Code>25012</Organization_Role_Code>
                <Organization_Role_Text>Originator</Organization_Role_Text>
                <Effective_Date>2020-03-02T01:30:03</Effective_Date>
                <Termination_Date xsi:nil="true"/>
            </Instrument_Organization>  
        </Instrument_Organizations>     
        <Instrument_Identifiers xsi:nil="true"/>        
        <Instrument_Attributes>     
            <Instrument_Attribute>  
                <Instrument_ID>831295951</Instrument_ID>
                <Rating_Attribute_Type_Code>5156438</Rating_Attribute_Type_Code>
                <Rating_Attribute_Type_Text>SF Indicator</Rating_Attribute_Type_Text>
                <Effective_Date>2020-03-02T01:30:03</Effective_Date>
                <Termination_Date xsi:nil="true"/>
                <Rating_Attribute_Code>5156439</Rating_Attribute_Code>
                <Rating_Attribute_Text>(sf)</Rating_Attribute_Text>
            </Instrument_Attribute> 
            <Instrument_Attribute>  
                <Instrument_ID>831295951</Instrument_ID>
                <Rating_Attribute_Type_Code>5168187</Rating_Attribute_Type_Code>
                <Rating_Attribute_Type_Text>SEC Exempt</Rating_Attribute_Type_Text>
                <Effective_Date>2020-03-02T01:30:03</Effective_Date>
                <Termination_Date xsi:nil="true"/>
                <Rating_Attribute_Code>5168185</Rating_Attribute_Code>
                <Rating_Attribute_Text>Exempt</Rating_Attribute_Text>
            </Instrument_Attribute> 
        </Instrument_Attributes>        
        <Instrument_Markets>        
            <Instrument_Market> 
                <Instrument_ID>831295951</Instrument_ID>
                <Domain_Number>75300</Domain_Number>
                <Effective_Date>2020-03-21T00:00:00</Effective_Date>
                <Termination_Date xsi:nil="true"/>
            </Instrument_Market>    
        </Instrument_Markets>       
    </Instrument_Root>          
 </Instrument_Roots>            

Any ideas on how to attack this would be greatly appreciated. Thanks.


Solution

  • The source of your problem is that your XML has a default namespace (http://www.XXXXX.com), so each attempt to locate an element must include this namespace (your code failed on this detail).

    To process your XML file I used the following code:

    1. Import:

      from lxml import etree as et
      
    2. Read the XML file:

      parser = et.XMLParser(remove_blank_text=True)
      tree = et.parse('Instrum.xml', parser)
      root = tree.getroot()
      
    3. Define the namespace used:

      ns = {'xx': 'http://www.XXXXX.com'}
      

      (will be used below).

    4. Fill tree1 with text content of children of each Instrument_Root:

      tree1 = []
      for elem in root.findall('xx:Instrument_Root/*', ns):
          txt = elem.text
          if txt is not None:
              tree1.append(txt)
      

      Note that Instrument_Root is a direct descendant of the root node, so it is enough to put just the node name.

    5. Fill tree2 with text content of children of each Instrument_Rating:

      tree2 = []
      for elem in root.findall('.//xx:Instrument_Rating/*', ns):
          txt = elem.text
          if txt is not None and len(txt.strip()) > 0:
              tree2.append(txt)
      

      This time Instrument_Rating is located somewhere deeper in the XML tree, so XPath must include // to perform "all levels" search.

      I added also some logic to avoid appending either non-existing text or text containing only "while" chars (delete it if you don't want to skip them).

    For your XML input sample I got:

    1. tree1:

      ['831295951', '831275547', '18705', 'Pass-Through', 'PAS', '2020-03-21T00:00:00',
       'AUD', 'N', '2051-03-21T00:00:00', '2051', '2020-03-21T00:00:00', '7.2534316791',
       'N', 'N', 'Class B', '24657', 'PASS-THRU CTFS', '24922', 'Not Applicable',
       '26', 'Floating', 'FLT', '16', 'Monthly', 'MON', 'MBS - Prime', 'Class B',
       'AUSTRALIA', '11.2500000000', 'Y', '3']
      
    2. tree2:

      ['831295951', '37203', '2020-03-02T01:30:03', '831295958', 'I', 'Senior Secured',
       '18705', 'Pass-Through', 'PAS', '25636', 'Long-Term Debt Rating', 'LT',
       '18743', 'Senior Secured', 'SS', '25648', 'Credit Risk', '5734', 'Enhanced',
       'ENH', '19142', 'Local Currency', 'NR', '0', '19102', 'DECISION NOT TO RATE',
       'NR', '534', 'Long-Term Debt Rating', 'LT', 'ENH', '2020-03-02T17:30:03', '25530',
       'DECISION NOT TO RATE', '20525', 'Australian Dollar', 'AUD', '1', 'Y']
      

    Note that there is no need for any nested loops.

    I think that based on the code above you will know how to extract content to fill tree3 and tree4.