have an html-file of this kind:
<html>
<head></head>
<body>
<p>
<dfn>A</dfn>sometext / ''
(<i>othertext</i>)someothertext / ''
(<i>...</i>)
(<i>...</i>)
</p>
<p>
<dfn>B</dfn>sometext / ''
(<i>othertext</i>)someothertext / ''
<i>blabla</i>
<i>bubu</i>
</p>
</body>
</html>
sometext / ' ' means that there can or cannot be some text following the dfn tag, same for i tags. also, i tags and text within them are not always present. Only text inside dfn tag is constantly present.
I need to get all textual information from every p-tag:
A, sometext, othertext, someothertext.
B, sometext, othertext, someothertext.
C, sometext, othertext, someothertext.
...
Z, sometext, othertext, someothertext.
The following code works almost OK, except that it goes to infinite looping when giving output.
for p in tree.xpath("//p"):
dfn = p.xpath('./dfn/text()')
after_dfn = p.xpath("./dfn/following::text()")
print '\n'.join(dfn), ''.join(after_dfn)
So, suppose I have all the letters of the ABC, I have this kind of output:
> A, sometext, othertext, someothertext.
>
> B, sometext, othertext, someothertext.
>
> C, sometext, othertext, someothertext.
>
> ...
>
> Z, sometext, othertext, someothertext.
> (2nd unnecessary loop):
>
> B, sometext, othertext, someothertext.
>
> C, sometext, othertext, someothertext.
>
> D, sometext, othertext, someothertext.
>
> ...
>
> Z, sometext, othertext, someothertext.
> (3rd unnecessary loop):
>
> C, sometext, othertext, someothertext.
>
> D, sometext, othertext, someothertext.
>
> E, sometext, othertext, someothertext.
>
> ...
>
> Z, sometext, othertext, someothertext...etc
It goes strangely from 1st p to the last one, then from 2nd to the last one, then from 3rd to the last one and so on... From the initial xml-file of 107 kb I receive an enormous horror of 26 MB when doing this! Please, help me to stop these loopings.
to get all text below p
just do:
tree.xpath("//p//text()")
if you need them aggregated per p
do:
[[y.strip() for y in x.xpath('.//text()') if y.strip()] for x in tree.xpath('//p')]
extract p
text based on i
text:
>>> [y.strip() for y in x.xpath('//i[.="blabla"]/..//text()') if y.strip()]
['B', 'sometext', 'othertext', 'someothertext', 'blabla', 'bubu']
or by dfn
text:
>>> [y.strip() for y in x.xpath('//dfn[.="B"]/..//text()') if y.strip()]
[['B', 'sometext', 'othertext', 'someothertext', 'blabla', 'bubu']]