Search code examples
pythonpython-3.xxpathlxmlnested-loops

How to parse efficiently specific lines with lxml Python (from .XML-files)?


I have made a code that can iterate over specific rows in XML.file, but I think it's not good coding due to inefficiency. My example .XML looks like this:

<data>0.0, 100.0</data>
<data>1.0, 101.0</data>
<data>2.0, 102.0</data>
<data>3.0, 103.0</data>
<data>4.0, 104.0</data>
<data>5.0, 105.0</data>
<data>6.0, 106.0</data>
<data>7.0, 107.0</data>
<data>8.0, 108.0</data>
<data>9.0, 109.0</data>
<data>10.0, 110.0</data>

In reality, I have tens of thousands of data rows and I only need some of it (Basically, I only know lines I want - nothing else). I would like to parse and get data only from specific rows efficiently. To do that (but not efficiently), I could use nested nested for loops, but I bet it's not good approach. However, I haven't figured out yet any other method. So let's say I want to parse and get data from rows 4 to 8:

import lxml.etree as ET
a = list(range(5, 10, 1)) # lists row numbers I want to parse and get data (lists rows 4, 5, 6, 7, 8)
tree = ET.parse('x.xml')
data = []           # List results
for x in a:             # loops data set
    for y in x:
        for z in tree.xpath('//data[{}]'.format(y)): #Uses xpath to find one by one data based on row number 
                datat = z.text
                data.append(datat) # List results in each iteration
                print(data)

Then output includes only 4.0, 104.0 5.0, 105.0 6.0, 106.0 7.0, 107.0 8.0, 108.0 I've considered whether I should use iterparse() -method or something else. However, this Liza Daly's parsing method suggests that Xpath is a good way to do that, but I guess I should reconsider my code. Too many for loops seems to be inefficient. Does anyone have any suggestions or hints (or links for further reading) how to improve this code?


Solution

  • I'm not sure if it's more efficient, but you can certainly simplify your code:

    dat = [your xml above]
    import lxml.html #needed to do this to be able to parse from string, not file
    tree = lxml.html.fromstring(dat)
    

    The simplified code:

    data = []
    for i in range(5, 10, 1): #since xpath counts from 1, while range counts from zero
        for z in tree.xpath(f'//data[{i}]'): 
                            data.append(z.text) 
    

    Check that it worked:

     for item in data:
            print(item)
    

    Output:

    4.0, 104.0
    5.0, 105.0
    6.0, 106.0
    7.0, 107.0
    8.0, 108.0