Search code examples
pythonxmlxpathlxmlmathml

Why does python lxml etree xpath return more than one element?


I am using lxml etree in python3

My xpath expression is like this, and is able to find the elements that I am looking for in my xhtml.

root = tree.getroot()
map = {'epub': 'http://www.idpf.org/2007/ops', 'm': "http://www.w3.org/1998/Math/MathML"}
mathML_elements = tree.xpath(".//m:math", namespaces=map)

Sample of the parsed xhtml is like this:

</td><td><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" alttext="-500"><m:mrow><m:mo>-</m:mo><m:mn>500</m:mn></m:mrow></m:math></td><td>0</td></tr><tr><td>8</td><td>Betalt renter på lånet</td><td>413</td><td></td><td>+</td><td><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" alttext="-413"><m:mrow><m:mo>-</m:mo><m:mn>413</m:mn></m:mrow></m:math></td><td>=</td><td><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" alttext="-413"><m:mrow><m:mo>-</m:mo><m:mn>413</m:mn></m:mrow></m:math></td><td>+</td><td></td><td>0</td></tr><tr><td>9</td><td>Avskrivning av pc og inventar</td><td>300</td><td><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" alttext="-300"><m:mrow><m:mo>-</m:mo><m:mn>300</m:mn></m:mrow></m:math></td><td>+</td><td></td><td>=</td><td><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" alttext="-300"><m:mrow><m:mo>-</m:mo><m:mn>300</m:mn></m:mrow></m:math></td><td>+</td><td></td><td>0</td></tr><tr><td>10</td><td>Uttak eier privat</td><td><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" alttext="-14 000"><m:mrow><m:mo>-</m:mo><m:mn>14 000</m:mn></m:mrow></m:math></td><td></td><td>+</td><td><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" alttext="-14 000"><m:mrow><m:mo>-</m:mo><m:mn>14 000</m:mn></m:mrow></m:math></td><td></td><td><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" alttext="-14 000"><m:mrow><m:mo>-</m:mo><m:mn>14 000</m:mn></m:mrow></m:math></td><td>+</td><td></td><td>0</td></tr><tr><td></td><td>Balansekontoer</td><td></td><td>29 700</td><td>+</td><td>122 680</td><td>=</td><td>103 080</td><td>+</td><td>49 500</td><td>0</td></tr><tr><td></td><td>Balansesum</td><td></td><td></td><td></td><td>152 080</td><td>=</td><td>152 080</td><td></td><td></td><td>0</td></tr></tbody></table>
<p>Vi ser at Trine Dals egenkapital har økt med <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" alttext="kr 1037 (kr 103080 - 102043)"><m:mrow><m:mi>kr </m:mi><m:mn>1 037</m:mn><m:mo>⁡</m:mo><m:mfenced><m:mrow><m:mi>kr </m:mi><m:mn>103 080</m:mn><m:mo>-</m:mo><m:mn>102 043</m:mn></m:mrow></m:mfenced></m:mrow></m:math>. Det betyr at det egentlige resultatet av driften denne måneden må være <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" alttext="kr 1037 + kr 14000 = kr 15037"><m:mrow><m:mi>kr </m:mi><m:mn>1 037</m:mn><m:mo>+</m:mo><m:mi>kr </m:mi><m:mn>14 000</m:mn><m:mo>=</m:mo><m:mi>kr </m:mi><m:mn>15 037</m:mn></m:mrow></m:math>. Vi viser for øvrig til resultatregnskapet i neste avsnitt.</p>
<p>✐ <strong>Oppgave 1-1 og 1-2, side 229.</strong></p>

My problem is that some of the elements also contains extra text at the end, as shown in one of the returned nodes from the xpath below:

<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" alttext="kr 1037 + kr 14000 = kr 15037"><m:mrow><m:mi>kr </m:mi><m:mn>1 037</m:mn><m:mo>+</m:mo><m:mi>kr </m:mi><m:mn>14 000</m:mn><m:mo>=</m:mo><m:mi>kr </m:mi><m:mn>15 037</m:mn></m:mrow></m:math>. Vi viser for øvrig til resultatregnskapet i neste avsnitt.

I only want the m:math element, what am I doing wrong?


Solution

  • That extra text is the .tail property of the _Element.

    How you handle the tail depends on what you want to do with the element.

    For example, if you're using tostring() to serialize the element, you can specify with_tail=False to not include the tail in the serialization.