Search code examples
openrefine

Extract XML attribute value based on element value


I have a few cells in OpenRefine with some XML in it (coming from nominatim) and, for each node, I would like to extract the value of an attribute only if the value of element in the same node is equal to a specific string ('Paris'). I am using jython to loop across the element and, if the element value is equal to Paris, return the desired attribute. Here's the code for it:

from xml.etree import ElementTree as ET
element = ET.fromstring(value).encode('utf8')
root = element.getroot()
resultsList = root.findall(".//place")
for result in resultsList:
    typerecord = result.find("city")
    if typerecord.text == "Paris":
        return result.attrib["lat"]

However it doesn't seems to work, even if the code seems fine to me. I get the following error:

Error: Traceback (most recent call last):
  File "<string>", line 3, in __temp_242115945__
  File "/opt/openrefine/webapp/extensions/jython/module/MOD-INF/lib/jython-standalone-2.7.2.jar/Lib/xml/etree/ElementTree.py", line 1313, in XML
  File "/opt/openrefine/webapp/extensions/jython/module/MOD-INF/lib/jython-standalone-2.7.2.jar/Lib/xml/etree/ElementTree.py", line 1653, in feed
  File "/opt/openrefine/webapp/extensions/jython/module/MOD-INF/lib/jython-standalone-2.7.2.jar/Lib/xml/etree/ElementTree.py", line 1653, in feed
  File "/opt/openrefine/webapp/extensions/jython/module/MOD-INF/lib/jython-standalone-2.7.2.jar/Lib/xml/parsers/expat.py", line 193, in Parse
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 115: ordinal not in range(128)

which appears to be more about the encoding of the characters. I added to the script .encode('utf8') but nothing change.

Here a sample of the XML:

<?xml version="1.0" encoding="UTF-8" ?>
<searchresults attribution="Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright" exclude_place_ids="18482590,103398643,118557459,109798886" more_url="https://nominatim.openstreetmap.org/search/?street=11+rue+Girardon&amp;city=Paris&amp;country=France&amp;addressdetails=1&amp;extratags=1&amp;polygon_geojson=1&amp;exclude_place_ids=18482590%2C103398643%2C118557459%2C109798886&amp;format=xml" querystring="11 rue Girardon, Paris, France" timestamp="Tue, 25 Oct 22 09:32:26 +0000">
    <place address_rank="30" boundingbox="43.6242386,43.6243386,1.4264894,1.4265894" class="place" display_name="11, Rue François Girardon, Minimes - Barrière de Paris, Toulouse Nord, Toulouse, Haute-Garonne, Occitanie, France métropolitaine, 31200, France" geojson="{&quot;type&quot;:&quot;Point&quot;,&quot;coordinates&quot;:[1.4265394,43.6242886]}" importance="0.5201" lat="43.6242886" lon="1.4265394" osm_id="2084506137" osm_type="node" place_id="18482590" place_rank="30" type="house">
        <extratags/>
        <house_number>11</house_number>
        <road>Rue François Girardon</road>
        <neighbourhood>Minimes - Barrière de Paris</neighbourhood>
        <suburb>Toulouse Nord</suburb>
        <city>Toulouse</city>
        <municipality>Toulouse</municipality>
        <county>Haute-Garonne</county>
        <ISO3166-2-lvl6>FR-31</ISO3166-2-lvl6>
        <state>Occitanie</state>
        <ISO3166-2-lvl4>FR-OCC</ISO3166-2-lvl4>
        <region>France métropolitaine</region>
        <postcode>31200</postcode>
        <country>France</country>
        <country_code>fr</country_code>
    </place>
    <place address_rank="26" boundingbox="48.8872626,48.8876471,2.3372233,2.3374922" class="highway" display_name="Rue Girardon, Quartier des Grandes-Carrières, Paris 18e Arrondissement, Paris, Île-de-France, France métropolitaine, 75018, France" geojson="{&quot;type&quot;:&quot;LineString&quot;,&quot;coordinates&quot;:[[2.3372233,48.8872626],[2.3372534,48.8873072],[2.337453,48.8875915],[2.3374922,48.8876471]]}" importance="0.52" lat="48.8875915" lon="2.337453" osm_id="10662867" osm_type="way" place_id="103398643" place_rank="26" type="residential">
        <extratags>
            <tag key="lit" value="yes"/>
            <tag key="surface" value="sett"/>
            <tag key="maxspeed" value="30"/>
            <tag key="sidewalk" value="both"/>
            <tag key="smoothness" value="intermediate"/>
            <tag key="cycleway:both" value="no"/>
            <tag key="zone:maxspeed" value="FR:30"/>
            <tag key="motor_vehicle:conditional" value="no @ (Su,PH 11:00-18:00)"/>
        </extratags>
        <road>Rue Girardon</road>
        <city_block>Quartier des Grandes-Carrières</city_block>
        <suburb>Paris 18e Arrondissement</suburb>
        <city_district>Paris</city_district>
        <city>Paris</city>
        <ISO3166-2-lvl6>FR-75</ISO3166-2-lvl6>
        <state>Île-de-France</state>
        <ISO3166-2-lvl4>FR-IDF</ISO3166-2-lvl4>
        <region>France métropolitaine</region>
        <postcode>75018</postcode>
        <country>France</country>
        <country_code>fr</country_code>
    </place>
    <place address_rank="26" boundingbox="48.8885135,48.8886689,2.3380551,2.3381062" class="highway" display_name="Rue Girardon, Quartier des Grandes-Carrières, Paris 18e Arrondissement, Paris, Île-de-France, France métropolitaine, 75018, France" geojson="{&quot;type&quot;:&quot;LineString&quot;,&quot;coordinates&quot;:[[2.3381062,48.8885135],[2.3380648,48.8886091],[2.3380551,48.8886689]]}" importance="0.52" lat="48.8886091" lon="2.3380648" osm_id="23371363" osm_type="way" place_id="109798886" place_rank="26" type="pedestrian">
        <extratags>
            <tag key="lit" value="yes"/>
            <tag key="surface" value="paving_stones"/>
            <tag key="smoothness" value="good"/>
        </extratags>
        <road>Rue Girardon</road>
        <city_block>Quartier des Grandes-Carrières</city_block>
        <suburb>Paris 18e Arrondissement</suburb>
        <city_district>Paris</city_district>
        <city>Paris</city>
        <ISO3166-2-lvl6>FR-75</ISO3166-2-lvl6>
        <state>Île-de-France</state>
        <ISO3166-2-lvl4>FR-IDF</ISO3166-2-lvl4>
        <region>France métropolitaine</region>
        <postcode>75018</postcode>
        <country>France</country>
        <country_code>fr</country_code>
    </place>
</searchresults>

Given the sample and the code, what I am expecting as a result is:

48.8875915
48.8886091

anyone who can help or suggest some GREL alternative for it?


Solution

  • Personally, I find non-trivial Python to be a royal pain to debug in OpenRefine's Jython preview, because GREL's fluent style is much easier to build up incrementally, so here's a GREL equivalent for your Python:

    forEach(value.parseXml().select('place'),p,if(p.select('city')[0].htmlText()=='Paris',p.htmlAttr('lat'),None)).join('|')
    

    It returns 48.8875915|48.8886091 (you can't store an array in a cell)

    Having said that, there are two problems with your Python:

    • you need to encode the string, not the value returned from fromstring(), ie ET.fromstring(value.encode('utf8')) not ET.fromstring(value).encode('utf8')
    • ElementTree.fromstring() returns the root element directly, so the getroot() is unnecessary.

    The patched up code is below, but note that it only returns the first value. It would need additional modifications to return all matches concatenated together in a string.

    from xml.etree import ElementTree as ET
    root = ET.fromstring(value.encode('utf8'))
    resultsList = root.findall(".//place")
    for result in resultsList:
        typerecord = result.find("city")
        if typerecord.text == "Paris":
            return result.attrib["lat"]