I have made a code that can iterate over specific rows in XML.file, but I think it's not good coding due to inefficiency. My example .XML looks like this:
<data>0.0, 100.0</data>
<data>1.0, 101.0</data>
<data>2.0, 102.0</data>
<data>3.0, 103.0</data>
<data>4.0, 104.0</data>
<data>5.0, 105.0</data>
<data>6.0, 106.0</data>
<data>7.0, 107.0</data>
<data>8.0, 108.0</data>
<data>9.0, 109.0</data>
<data>10.0, 110.0</data>
In reality, I have tens of thousands of data rows and I only need some of it (Basically, I only know lines I want - nothing else). I would like to parse and get data only from specific rows efficiently. To do that (but not efficiently), I could use nested nested for loops, but I bet it's not good approach. However, I haven't figured out yet any other method. So let's say I want to parse and get data from rows 4 to 8:
import lxml.etree as ET
a = list(range(5, 10, 1)) # lists row numbers I want to parse and get data (lists rows 4, 5, 6, 7, 8)
tree = ET.parse('x.xml')
data = [] # List results
for x in a: # loops data set
for y in x:
for z in tree.xpath('//data[{}]'.format(y)): #Uses xpath to find one by one data based on row number
datat = z.text
data.append(datat) # List results in each iteration
print(data)
Then output includes only
4.0, 104.0
5.0, 105.0
6.0, 106.0
7.0, 107.0
8.0, 108.0
I've considered whether I should use iterparse() -method or something else. However, this Liza Daly's parsing method suggests that Xpath is a good way to do that, but I guess I should reconsider my code. Too many for loops seems to be inefficient. Does anyone have any suggestions or hints (or links for further reading) how to improve this code?
I'm not sure if it's more efficient, but you can certainly simplify your code:
dat = [your xml above]
import lxml.html #needed to do this to be able to parse from string, not file
tree = lxml.html.fromstring(dat)
The simplified code:
data = []
for i in range(5, 10, 1): #since xpath counts from 1, while range counts from zero
for z in tree.xpath(f'//data[{i}]'):
data.append(z.text)
Check that it worked:
for item in data:
print(item)
Output:
4.0, 104.0
5.0, 105.0
6.0, 106.0
7.0, 107.0
8.0, 108.0