I am trying to process a bunch of XML files and add certain attributes to specific elements if certain conditions are met.I have different versions of the same XML documents. In some of them there are some tags that are used to provide formatting information. Here's an example:
<tok id="w-47692" ord="96" lemma="bové" xpos="VMIP3S0">Bové</tok>
<space/>
<add>
<tok form="Bernadó" id="w-47693" ord="97" lemma="bernadó" xpos="NP00000">Bernadó</tok>
</add>
<add>
<tok id="w-47694" ord="98" lemma="mayor" xpos="NCMS000">Mayor</tok>
</add>
<tok id="w-47695" ord="99" lemma="ferran" xpos="NP00000">Ferran</tok>
The tag <space/>
indicates that the original document, of which the XML document is a representation, had a space. The tags <add> </add>
indicate that the contents appearing between these tags were not part of the original manuscript. The other versions of the XML documents have been cleaned of these formatting elements:
<tok form="Bernadó" id="w-47693" ord="97" lemma="bernadó" xpos="NP00000">Bernadó</tok>
<tok id="w-47694" ord="98" lemma="mayor" xpos="NCMS000">Mayor</tok>
<tok id="w-47695" ord="99" lemma="ferran" xpos="NP00000">Ferran</tok>
The problem I'm experiencing is that the script yields different results for the processing of the two documents. These are the two outputs:
<add>
<tok form="Bernadó" id="w-47693" ord="97" lemma="bernadó" xpos="NP00000">Bernadó</tok>
</add>
<add>
<tok id="w-47694" ord="98" lemma="mayor" xpos="NCMS000">Mayor</tok>
</add>
<tok form="Bernadó" id="w-47693" ord="97" lemma="bernadó" xpos="NP0000">Bernadó</tok>
<tok form="Mayor" id="w-47694" ord="98" lemma="Major" xpos="NP0000">Mayor</tok>
I'm not including the entire script I'm using because it is rather complex but here's a simplified version of the relevant part of the script:
def fix_matching_toks(tok):
"""
Fix matching tokens by setting their lemma and xpos attributes.
"""
preceding_tok = tok.xpath("preceding-sibling::tok[1]")
if preceding_tok:
preceding_tok_last_dtok = preceding_tok[0].xpath("./dtok[last()]")
else:
preceding_tok_last_dtok = None
if tok.tag == "tok":
if preceding_tok and preceding_tok[0].get("xpos") is not None and preceding_tok[0].get("xpos", "").startswith("NP"):
tok.set("lemma", "Major")
tok.set("xpos", "NP00000")
It seems to me that what is happening is that the tags surrounding the relevant tok elements are preventing the XPath expression from matching the intended elements. I thought the XPath expression xpath("preceding-sibling::tok[1]")
would match the preceding tok element independently of whether there is an element intervening but it is clear that this is not the case.
How can I prevent this from happening? I want all of the XML documents to be processed in the same way independently of whether they contain these intervening "extraneous" elements or not.
I see 2 options in XPath:
Easy one (but maybe slower):
preceding::tok[1]
Complex one(but probably faster)
parent::add/preceding-sibling::*[1][self::add[tok]]/tok
| parent::add/preceding-sibling::*[1][self::tok]
| preceding-sibling::*[1][self::add[tok]]/tok
| preceding-sibling::*[1][self::tok]