Search code examples
pythonxmllint

How to extract lines between repetitive n number of tags from XML and continue until last tag?


I have an XML file with over 2,500 <Item> elements.

The example below shows the sample layout. I want to copy every line in between <Item name="1st"> and the <Item name="500th"> to a new file as is. Then continue to the next 500 from <Item name=501st"> onwards, and write that out to a new file. Result is 5 new files. Nothing to be skipped.

<Item name="1st"><ItemProperties>
<property>data</property><property>data</property>
</ItemProperties>
...
...
<Item name="500th"><ItemProperties>
<property>data</property><property>data</property>
</ItemProperties>

The below operation does it for the first 500, but I do not know how to keep going until the last closing tag.

xmllint --xpath "//Item[position()<=500]" FileName.XML > Output1.XML

See this link for an example:


Solution

  • Using python, first solution is to treat from line 0 to the last line, one line at a time:

    nfh = None
    with open('foo.xml') as fh:
        num = 0
        for index, line in enumerate(fh):
            if not index % 500:
                num += 1
                if nfh:
                    nfh.close()
                nfh =  open('file_name{}.txt'.format(num), 'w')
            nfh.write(line)
    if nfh:
        nfh.close()
    

    Second, is using lxml to enumerate only specific tag in the XML file:

    import lxml.etree as etree
    xml_data = etree.parse('foo.xml')
    nfh = None
    num = 0
    
    for index, tag in enumerate(xml_data.xpath('//Item')):
        # Enumerate 500 tags
        if not index % 500:
            num += 1
            if nfh:
                nfh.close()
            nfh =  open('Output{}.XML'.format(num), 'wb')
        nfh.write(etree.tostring(tag))
    if nfh:
        nfh.close()
    

    This, assuming your XML is closer to this:

    <root>
    <Item name="1st"><ItemProperties>
    <property>data</property><property>data</property>
    </ItemProperties>
    </Item>
    <Item name="2nd"><ItemProperties>
    <property>data</property><property>data</property>
    </ItemProperties>
    </Item>
    ....
    <Item name="500th"><ItemProperties>
    <property>data</property><property>data</property>
    </ItemProperties>
    </Item>
    ....
    </root>