Search code examples
jsouprdfopenrefinegrelrdf-xml

Extracting skos:closeMatch from RDF/XML using GREL in OpenRefine


This is a picture of my OpenRefine project. I need to extract all the instances of skos:CloseMacth URIs from an RDF/XML column into a separate column in OpenRefine.

This is my RDF/XML code:

<rdf:RDF xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/1999/02/22-rdf-schema#" xmlns:cs="http://purl.org/vocab/changeset/schema#" xmlns:skosxl="http://www.w3.org/2008/05/skos-xl#">
  <rdf:Description rdf:about="http://id.loc.gov/authorities/subjects/sh85145648">
    <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/>
    <skos:prefLabel xml:lang="en">Water-supply</skos:prefLabel>
    <skosxl:altLabel>
      <rdf:Description>
    <rdf:type rdf:resource="http://www.w3.org/2008/05/skos-xl#Label"/>
    <skosxl:literalForm xml:lang="en">Availability, Water</skosxl:literalForm>
      </rdf:Description>
    </skosxl:altLabel>
    <skosxl:altLabel>
      <rdf:Description>
    <rdf:type rdf:resource="http://www.w3.org/2008/05/skos-xl#Label"/>
    <skosxl:literalForm xml:lang="en">Water availability</skosxl:literalForm>
      </rdf:Description>
    </skosxl:altLabel>
    <skosxl:altLabel>
      <rdf:Description>
    <rdf:type rdf:resource="http://www.w3.org/2008/05/skos-xl#Label"/>
    <skosxl:literalForm xml:lang="en">Water resources</skosxl:literalForm>
      </rdf:Description>
    </skosxl:altLabel>
    <skos:closeMatch rdf:resource="http://www.yso.fi/onto/yso/p9967"/>
    <skos:closeMatch rdf:resource="http://id.worldcat.org/fast/1172350"/>
    <skos:closeMatch rdf:resource="http://www.wikidata.org/entity/Q1061108"/>
    <skos:closeMatch rdf:resource="http://id.worldcat.org/fast/1172350"/>
    <skos:closeMatch rdf:resource="http://www.wikidata.org/entity/Q1061108"/>
    <skos:closeMatch rdf:resource="http://www.yso.fi/onto/yso/p9967"/>
    <skos:changeNote>
      <cs:ChangeSet>
    <cs:subjectOfChange rdf:resource="http://id.loc.gov/authorities/subjects/sh85145648"/>
    <cs:creatorName rdf:resource="http://id.loc.gov/vocabulary/organizations/dlc"/>
    <cs:createdDate rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">1986-02-11T00:00:00</cs:createdDate>
    <cs:changeReason rdf:datatype="http://www.w3.org/2001/XMLSchema#string">new</cs:changeReason>
      </cs:ChangeSet>
    </skos:changeNote>
    <skos:changeNote>
      <cs:ChangeSet>
    <cs:subjectOfChange rdf:resource="http://id.loc.gov/authorities/subjects/sh85145648"/>
    <cs:creatorName rdf:resource="http://id.loc.gov/vocabulary/organizations/dlc"/>
    <cs:createdDate rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-11-17T07:36:37</cs:createdDate>
    <cs:changeReason rdf:datatype="http://www.w3.org/2001/XMLSchema#string">revised</cs:changeReason>
      </cs:ChangeSet>
    </skos:changeNote>
  </rdf:Description>
</rdf:RDF>

I tried this code value.parseHtml().select('skos|closematch') to add a column based on the RDF/XML column, but it doesn't work.


Solution

  • Your code is pretty close. Were you examining the display of the preview column to help guide you?

    Your code returns an array of six XML elements. The things that you're missing are:

    • an iterator - forEach()
    • a function to fetch the value of the attribute - htmlAttr()
    • something to convert the array to a single value which can be stored in the column - join()

    Altogether it'll look like: forEach(value.parseHtml().select('skos|closeMatch'), element, element.htmlAttr('rdf:resource')).join('|')

    I actually built this from the inside out by starting with a single element: value.parseHtml().select('skos|closeMatch')[0] to see what it looked like and then adding the .htmlAttr('rdf:resource') before wrapping the entire thing with forEach(...).join('|') (Obviously you can choose whatever delimiter you find most useful)

    Update: your data has duplicates, so you might want to add .uniques() like:

    forEach(value.parseHtml().select('skos|closeMatch'), element, element.htmlAttr('rdf:resource')).uniques().join('|')